• Hey, guest user. Hope you're enjoying NeoGAF! Have you considered registering for an account? Come join us and add your take to the daily discourse.
  • The Politics forum has been nuked. Please do not bring political discussion to the rest of the site, or you will be removed. Thanks.

Hardware AMD Oberon PlayStation 5 SoC Die Delidded and Pictured

rnlval

Member
Jun 26, 2017
1,375
1,123
460
Sector 001
gpucuriosity.wordpress.com
YOU'RE. Not YOUR. Keep that in mind next time you call someone a dumb fuck.

The fact that you dont understand why I am bringing up overclocking just shows how utterly clueless you are about how tflops and clocks correlate.


This right here is asinine. You are spitting in the face of over a decade of computer graphics to spout nonsense that has no basis in reality. Every PC GPU can already out perform its theoretical tflops. It can go beyond its theoretical clock limits. We saw this in the video you continue to ignore because it shows the 5700xt hitting well above the 1.91 Ghz clockspeeds AMD themselves used to calculate the card's theoretical tflops number. 9.75.

40 CUs * 64 Shader processors * 2 * 1.91 Ghz = 9.75

For the 6600xt they used the clocks max at 2.589 Ghz to calculate the card's theoretical tflops.

32 CUs * 64 Shader cores * 2 * 2.589 Ghz = 10.6 Tflops



Notice how both cards are able to hit higher clocks in most games which means they are operating beyond their theoretical maximums. Something you said that never happens because cards never even hit their theoritical maximum.

Here is horizon running the game at 2.79 Ghz and the 5700xt runs it at 2.153 Ghz. Both 200 mhz BEYOND the card's theoretical maximum limit AMD themselves advertised.




Your theory about cards not hitting max tflops is WRONG on every level. The only way the card will not be fully utilized is if they are capped at 30 fps or 60 fps and the dev is content with leaving a lot of performance on the table. But we have seen every game drop frames this gen and nearly every game utilize DRS which literally drops the resolution to allow the GPU to operate at full capacity so as to not leave performance on the table.

I have a UPS (Uninterruptible Power Supply) that lets me view the power consumption of my PS5, TV or PC at any given moment. I can easily see which games fully max out the APU and which ones dont. BC games without PS5 patches top out at 100w. These are your Uncharted 4's running at PS4 Pro clocks. Then you have games like Horizon which are patched to utilize higher PS5 clocks and they consume a bit more. Then you have games like Doom Eternal running on PS5 SDK fully utilizing the console and i can see the power consumption at 205-211 watts consistently. Same thing DF reported when they ran Gears 5's XSX native port. It was up to 211-220w at times. Whats consuming all that power if not the goddamn GPU running at its max clocks?

This lines up with whatever happens on my PC. When I run Hades at native 4k 120 fps, my GPU utilization sits at roughly 40%. If i leave the framerate uncapped, it goes up to 99% and runs the game at 350 fps. Games are designed to automatically scale up. its been this way for well over a decade since modern GPUs arrived in the mid 2000s. If they didnt scale, you would not see GoW and the Last guardian automatically hit 60 fps on the PS5 without any patches. If they didnt scale with CUs, you would not see Far Cry 6 have a consistent resolution advantage on the XSX. Same goes for Doom Eternal. These games run well because modern GPUs are able to utilize not just clocks but all the shader cores.
PC AIB GPU vendors are acting like Sony or MS position when it comes to out-of-the-box GPU's clock speed profiles. AMD and NVIDIA can advise clock speed profiles but the PC GPU clock speed profile setting comes from PC AIB GPU vendors. EVGA is responsible when they apply FTW ("For-The-Win") super overclocks their GPU cards and causes abnormal product failures.

For example
MSI RTX 2080 Ti GX Trio is faster than RTX Titan XP reference.

MSI RTX 3080 Ti GX Trio is faster than RTX 3090 reference.
 

SlimySnake

Member
Feb 5, 2013
12,734
36,121
1,260
You replied to someone talking about the XBSX reaching 12TF, who replying to someone saying the PS5 can't reach 10TF with a post about PC GPU utilization.
Then they are both wrong. Both consoles operate at peak frequency.
 

rnlval

Member
Jun 26, 2017
1,375
1,123
460
Sector 001
gpucuriosity.wordpress.com
No GPU hits it's theoretical TF limit because that is a best case scenario. It would achieve that if it only calculated fp instructions only and if everything ran at 100% maximum including feeding all 52 CU's simultaneously and at maximum. That would never happen while gaming.
I don't know why your telling me that if you got the SX GPU and overclocked it, it would hit higher TF limit? That was never the argument.
Note that the 12.147 TFLOPS number does NOT factor in the scalar units in each WGP.

This is literally the dumbest thing ive heard all week. The cards will not overheat because they are spec'd to hit those tflops out of the box. They will only overheat if you push the clocks beyond the power limits set by the manufacturer. Thats called overclocking. the 10.3 and 12.1 tflops limits set by console manufacturers are set precisely to avoid the cards overheating because thats what they have determined to be the safest and highest clocks that stay within the power limits.

You need to start paying attention to what other people post around here instead of dismissing everything. You might learn a thing or two. Otherwise you'd end up embarrassing yourself saying dumb shit like cards overheating because they are hitting their theoretical max. Absolute nonsense.
It's
10.28 TFLOPS vector units, not including scalar units.
12.147 TFLOPS vector units, not including scalar units.
 
Last edited:

SlimySnake

Member
Feb 5, 2013
12,734
36,121
1,260
PC AIB GPU vendors are acting like Sony or MS position when it comes to out-of-the-box GPU's clock speed profiles. AMD and NVIDIA can advise clock speed profiles but the PC GPU clock speed profile setting comes from PC AIB GPU vendors. EVGA is responsible when they apply FTW ("For-The-Win") super overclocks their GPU cards and causes abnormal product failures.

For example
MSI RTX 2080 Ti GX Trio is faster than RTX Titan XP reference.

MSI RTX 3080 Ti GX Trio is faster than RTX 3090 reference.
Precisely. AMD and Nvidia play it safe with their boost clocks. Nvidia actually advertises only 1.7 ghz of clockspeeds for my 2080 when in game i consistently hover around 1950 mhz. I am guessing they do this to avoid higher TDP values on their spec sheets or just lawsuits since they cant promise it wont drop below that clockspeed.

But if we go by those theoretical limits set by AMD and Nvidia then techincally every single card is outperforming its theoretical flops. The AIB cards overclock even more. Well beyond the theortical tflop maximum. It's just insane to say that GPUs will never hit the max clocks or that CUs wont be fully utilized. The PS5 has doubled the CUs of the PS4 and yet no one seems to have trouble utilizing those 2x extra CUs. Not in BC mode patches. Or when creating games on the new PS5 SDK.

it would be insanely dumb of AMD and Cerny to release a console that doesnt effectively utilize each and every CU. AMD would not be in business if their extra CUs werent effectively utilized because their entire 6000 series lineup relies on upping the CU counts to 72 all the way up to 80.
 
  • Like
  • LOL
Reactions: Darius87 and rnlval

yewles1

Member
Mar 23, 2020
717
2,830
540
38
Indianapolis, In
Which is why this utterly ignorant insistence on focussing only on BS theoretical TFLOPs marketing numbers is just a massive exercise in stupidity and only serves to muddy any possibility for intelligent discourse on actual computing hardware performance.
It might piss off people even more to find out that we're realistically looking at MAYBE 4TFLOP/s average for PS5 and MAYBE 5TFLOP/s average for XSX. (don't ask how I got the numbers, it ain't pretty)
 
  • Like
Reactions: Rea and Darius87

rnlval

Member
Jun 26, 2017
1,375
1,123
460
Sector 001
gpucuriosity.wordpress.com
I don't know if this workaround is still effective with GPUs that use delta color compression.
Note why AMD was pushing for Async compute workload (with texture read/write IO path connected to L2 cache) until PC RDNA 2 with 128 ROPS and super fast 128 MB L3 cache (render cache, the entire 4K frame buffer with DCC can fit).

Console RDNA 2 ROPS are connected to 4 to 5 MB L2 cache (refers to GPU's L2 cache not CPU's L2 cache, pipeline optimization involves micro-tile cache render methods).
 

Darius87

Member
Jul 16, 2018
1,093
2,638
525
What are you blind? Scroll up. I literally posted a comparison of 6600xt and 5700xt which shows both cards consistently hitting their peak clocks and 99% gpu utilization. Do you even read my posts?
running at max clocks doesn't mean you're utilizing 100% of your GPU you also have to understand that it isn't possible to switch every transistor every cycle inside CU's that's what is 100% CU utilization is.
Take a few hours to watch YouTube videos of PC YouTubers benchmarking cars and see how overclocking is done to push the card beyond its limits. Go look at AIBs’ versions of gpus that are cooled with better cooling solutions and overclocked to get better performance for the same exact chip but with a higher clock. Look at the gpu utilization during these benchmarks of games and demos. It will almost always be 99%. Because even with higher clocks on the same chip with the same CU count, the clocks define the performance gains.
Overclocking has nothing to do with reaching Theoretical max of your GPU.
Youtubers aren't programmers how do they know what's CU utilization?
This right here is asinine. You are spitting in the face of over a decade of computer graphics to spout nonsense that has no basis in reality. Every PC GPU can already out perform its theoretical tflops. It can go beyond its theoretical clock limits. We saw this in the video you continue to ignore because it shows the 5700xt hitting well above the 1.91 Ghz clockspeeds AMD themselves used to calculate the card's theoretical tflops number. 9.75.
i suggest you first to figure out why "Theoretical" term is used in many cases speaking about Tflops.
Your theory about cards not hitting max tflops is WRONG on every level. The only way the card will not be fully utilized is if they are capped at 30 fps or 60 fps and the dev is content with leaving a lot of performance on the table. But we have seen every game drop frames this gen and nearly every game utilize DRS which literally drops the resolution to allow the GPU to operate at full capacity so as to not leave performance on the table.
No dude just stop this nonsense there's no game that fully utilizes all ALU's here's proof:
"I think you're asking what happens if there is a piece of code intentionally written so that every transistor (or the maximum number of transistors possible) in the CPU and GPU flip on every cycle. That's a pretty abstract question, games aren't anywhere near that amount of power consumption. In fact, if such a piece of code were to run on existing consoles, the power consumption would be well out of the intended operating range and it's even possible that the console would go into thermal shutdown.
https://www.eurogamer.net/articles/digitalfoundry-2020-playstation-5-the-mark-cerny-tech-deep-dive

think logically if games can utilize CU's at 99% at given frame there won't be Async compute any tasks to do... horizon is one game that's heavy on async compute, does that mean it's not using async compute on PC because it's 99% of CU's utilization?
let's say we have 12 Tflops GPU so what you're telling when it's running at 99% that this GPU does near 12 TRILLION float operatins that's 12 000 000 000 000 in 1 second just think about this number and how it is possible to task such amount of op to GPU in 1 second.
This lines up with whatever happens on my PC. When I run Hades at native 4k 120 fps, my GPU utilization sits at roughly 40%. If i leave the framerate uncapped, it goes up to 99% and runs the game at 350 fps. Games are designed to automatically scale up. its been this way for well over a decade since modern GPUs arrived in the mid 2000s. If they didnt scale, you would not see GoW and the Last guardian automatically hit 60 fps on the PS5 without any patches. If they didnt scale with CUs, you would not see Far Cry 6 have a consistent resolution advantage on the XSX. Same goes for Doom Eternal. These games run well because modern GPUs are able to utilize not just clocks but all the shader cores.
GPU utilization graph on screen doesn't show CU's utilization only, it's multiple factors of GPU your most fastest/powerfull silicon on GPU which are CU's can run as fast as other parts of GPU/CPU allows in game pipeline. because when you program you have to paralellize work for CU's which is hard to do because it have so many cores 1CU have 64 cores/processors that can add/multiply floats.
 

John Wick

Member
Jul 23, 2015
2,037
2,152
560
United Kingdom
YOU'RE. Not YOUR. Keep that in mind next time you call someone a dumb fuck.

The fact that you dont understand why I am bringing up overclocking just shows how utterly clueless you are about how tflops and clocks correlate.


This right here is asinine. You are spitting in the face of over a decade of computer graphics to spout nonsense that has no basis in reality. Every PC GPU can already out perform its theoretical tflops. It can go beyond its theoretical clock limits. We saw this in the video you continue to ignore because it shows the 5700xt hitting well above the 1.91 Ghz clockspeeds AMD themselves used to calculate the card's theoretical tflops number. 9.75.

40 CUs * 64 Shader processors * 2 * 1.91 Ghz = 9.75

For the 6600xt they used the clocks max at 2.589 Ghz to calculate the card's theoretical tflops.

32 CUs * 64 Shader cores * 2 * 2.589 Ghz = 10.6 Tflops



Notice how both cards are able to hit higher clocks in most games which means they are operating beyond their theoretical maximums. Something you said that never happens because cards never even hit their theoritical maximum.

Here is horizon running the game at 2.79 Ghz and the 5700xt runs it at 2.153 Ghz. Both 200 mhz BEYOND the card's theoretical maximum limit AMD themselves advertised.




Your theory about cards not hitting max tflops is WRONG on every level. The only way the card will not be fully utilized is if they are capped at 30 fps or 60 fps and the dev is content with leaving a lot of performance on the table. But we have seen every game drop frames this gen and nearly every game utilize DRS which literally drops the resolution to allow the GPU to operate at full capacity so as to not leave performance on the table.

I have a UPS (Uninterruptible Power Supply) that lets me view the power consumption of my PS5, TV or PC at any given moment. I can easily see which games fully max out the APU and which ones dont. BC games without PS5 patches top out at 100w. These are your Uncharted 4's running at PS4 Pro clocks. Then you have games like Horizon which are patched to utilize higher PS5 clocks and they consume a bit more. Then you have games like Doom Eternal running on PS5 SDK fully utilizing the console and i can see the power consumption at 205-211 watts consistently. Same thing DF reported when they ran Gears 5's XSX native port. It was up to 211-220w at times. Whats consuming all that power if not the goddamn GPU running at its max clocks?

This lines up with whatever happens on my PC. When I run Hades at native 4k 120 fps, my GPU utilization sits at roughly 40%. If i leave the framerate uncapped, it goes up to 99% and runs the game at 350 fps. Games are designed to automatically scale up. its been this way for well over a decade since modern GPUs arrived in the mid 2000s. If they didnt scale, you would not see GoW and the Last guardian automatically hit 60 fps on the PS5 without any patches. If they didnt scale with CUs, you would not see Far Cry 6 have a consistent resolution advantage on the XSX. Same goes for Doom Eternal. These games run well because modern GPUs are able to utilize not just clocks but all the shader cores.
You've written an essay to explain what?
Teraflops are calculated by the amount of floating point instructions the GPU can perform per second. To reach it's maximum limit it would have to perform with everything working at 100% all the time. This is about games. Games aren't made exclusively with TFs. Because as soon as you start doing other game related work do you think the GPU will still reach the maximum TF? So just imagine how many tasks a GPU is doing every second in a game? So do you still think the TF maximum will be reached?
I think your confusing hitting max clocks with utilising a GPU 100%. It's impossible because there isn't any code written that could.
 
Last edited:

John Wick

Member
Jul 23, 2015
2,037
2,152
560
United Kingdom
Note that RTX 3090's TFLOPS is split between INT/FP and FP CUDA cores.

Turing SM has INT and FP CUDA cores. Ampere SM evolved Turing INT cores into INT/FP cores.



Integer shader workloads did NOT disappear when RTX Ampere was released!

AMD RDNA has common shader units for both integer and floating units. Typical TFLOPS argument between Turing vs RDNA hides Turing's extra TIOPS compute capability.

Ampere RTX' extra shader compute power is useful for mesh shaders, denoise raytracing, DirectStorage decompression, DirectML, and 'etc'.


17.79 FLOPS from FP CUDA. If you add 40% extra TIOPS it would land on 24.906 TOPS. The difference between RTX 2080 Ti vs RTX 3090 is extra TFLOPS (INT units able to convert into FP units in Ampere), hence RTX 2080 Ti's 24.906 TOPS vs RTX 3090's 38.58 TOPS yields 58% advantage for RTX 3090.

In most games, RTX 3080 Ti and RTX 3090 have about 58% advantage over RTX 2080 Ti.
But no card is gonna reach that maximum. The code doesn't exist that can utilise the GPU 100% all the time. To do that the GPU would be have to running at max clock speed as well as the memory. All CU's saturated simultaneously, ALU's, Rops all running 100%. Impossible to do. Card would overheat and down clock.
 
Last edited:

John Wick

Member
Jul 23, 2015
2,037
2,152
560
United Kingdom
A
No idea what this has to do with what I posted. I never questioned PS5's variable clock speeds. I never even mentioned the PS5 in that post. You are either confusing me with someone else or you think I am an xbox fanboy just because I dared to defend GPU's hitting their theoretical tflops max. Something we just took for granted when the PS4 said they were 1.84 tflops because their clock was 800 mhz and they had 18 CUs, and when the PS4 Pro came out and they said it was now 36 CUs at 911 Mhz. We consistently saw the PS4 have a 40% resolution advantage over the x1 last gen which settled around 900p after a rough first year. Then we consistently saw the Ps4 Pro offer 2x more resolution than the PS4 in line with the 2.2x increase in theoretical tflops. Then we saw the X1X consistently offer a 40% increase in pixels over the PS4 consistent with its theoritical tflops difference over the PS4 Pro.

But now we are throwing away a decade of console tflops performance results based on theoretical tflops because?
Mate no one is arguing that the SX is more powerful in compute. It will have a 10-15% advantage in games that favour that. But the SX isn't gonna reach the 12.15tf theoretical limit just because it has fixed clocks that are sustained.
I'll also apologise for being rude to you. Sorry about that. You are one of the better contributers on this forum
 
Sep 18, 2019
1,549
3,920
390
Note why AMD was pushing for Async compute workload (with texture read/write IO path connected to L2 cache) until PC RDNA 2 with 128 ROPS and super fast 128 MB L3 cache (render cache, the entire 4K frame buffer with DCC can fit).

Console RDNA 2 ROPS are connected to 4 to 5 MB L2 cache (refers to GPU's L2 cache not CPU's L2 cache, pipeline optimization involves micro-tile cache render methods).
ROPs are connected to L1 cache. Both consoles have the same amount of L1 cache.

 

rnlval

Member
Jun 26, 2017
1,375
1,123
460
Sector 001
gpucuriosity.wordpress.com
ROPs are connected to L1 cache. Both consoles have the same amount of L1 cache.

L1 cache is tiny. The major improvements between RDNA v1 vs PC RDNA v2 are the large enough for typical frame buffers L3 cache that yields increased memory bandwidth.



I purposely ignored RDNA's L1 cache improvement since it's not a major factor for PC RDNA v2's memory bandwidth improvements.
 
Last edited:

rnlval

Member
Jun 26, 2017
1,375
1,123
460
Sector 001
gpucuriosity.wordpress.com
But no card is gonna reach that maximum. The code doesn't exist that can utilise the GPU 100% all the time. To do that the GPU would be have to running at max clock speed as well as the memory. All CU's saturated simultaneously, ALU's, Rops all running 100%. Impossible to do. Card would overheat and down clock.
Note the purpose for boost clocks i.e. when not fully utilize, use the available TDP for increased clock speed. Out-of-the-box AIB OC has a higher base and boosts clock speed when compared to the reference design. My argument was NOT about 100% utilization.

The use case for larger-scale GPU design is increased performance.

My post is against your "" comment when it should be 35.58 TOPS when comparing Turing (INT + FP CUDA cores) vs Ampere (INT/FP + FP CUDA cores) GPUs. Your argument has hidden the separate Tera Integer Operations Per Second (TIOPS) capability.

AMD's TFLOPS are shared with TIOPS shader workloads, hence Tera Operations Per Second (TOPS) should be used.



Reference RTX 3090 can reach 38.419 TFLOPS via the compute/TMU read-write IO path. Nearly half of 38.419 TFLOPS are shared with 20.761 TIOPS (integer).

RTX 2080 Ti is unable to convert INT CUDA cores into FP CUDA cores. We do know last-gen game's integer workloads can reach 40% of the FP workload.

Most of PC DirectX12U's new features are compute path related e.g. Mesh Shader related Next-Generation Graphics Pipeline (NGGP), RT denoise, DirectML, DirectStorage Decompression (GpGPU path), and 'etc'.
It's nearly a no-brainer as to why RTX 3080 Ti/3090 beats NAVI 21 on DirectX12U enabled performance i.e. AMD doesn't have TFLOPS/TIOPS high ground.

running at max clocks doesn't mean you're utilizing 100% of your GPU you also have to understand that it isn't possible to switch every transistor every cycle inside CU's that's what is 100% CU utilization is.

Overclocking has nothing to do with reaching Theoretical max of your GPU.
Youtubers aren't programmers how do they know what's CU utilization?

i suggest you first to figure out why "Theoretical" term is used in many cases speaking about Tflops.

No dude just stop this nonsense there's no game that fully utilizes all ALU's here's proof:

https://www.eurogamer.net/articles/digitalfoundry-2020-playstation-5-the-mark-cerny-tech-deep-dive

think logically if games can utilize CU's at 99% at given frame there won't be Async compute any tasks to do... horizon is one game that's heavy on async compute, does that mean it's not using async compute on PC because it's 99% of CU's utilization?
let's say we have 12 Tflops GPU so what you're telling when it's running at 99% that this GPU does near 12 TRILLION float operatins that's 12 000 000 000 000 in 1 second just think about this number and how it is possible to task such amount of op to GPU in 1 second.

GPU utilization graph on screen doesn't show CU's utilization only, it's multiple factors of GPU your most fastest/powerfull silicon on GPU which are CU's can run as fast as other parts of GPU/CPU allows in game pipeline. because when you program you have to paralellize work for CU's which is hard to do because it have so many cores 1CU have 64 cores/processors that can add/multiply floats.
Doom 2016 is also a heavy Async Compute enabled game and when Async Compute is enabled, MSAA is disabled since TMU read-write path doesn't have ROP's MSAA hardware.
 
Last edited:

Tripolygon

Member
May 6, 2012
4,855
7,576
1,160
NYC
What are you blind? Scroll up. I literally posted a comparison of 6600xt and 5700xt which shows both cards consistently hitting their peak clocks and 99% gpu utilization. Do you even read my posts?

You need to read up on this stuff a bit more. Everything you said is wrong. Literally everything. I don’t even know where to start. You are seriously asking me which tests these companies do to determine their peak clocks? Are you new to gaming? How old are you? Serious question. Have you ever owned a gaming PC?

I have never met anyone who thinks GPUs will overheat and die if ran at max clocks out of the box for more than 10 seconds. Its hilarious to see you post lol emojis to every post because your replies are laughable.

Take a few hours to watch YouTube videos of PC YouTubers benchmarking cars and see how overclocking is done to push the card beyond its limits. Go look at AIBs’ versions of gpus that are cooled with better cooling solutions and overclocked to get better performance for the same exact chip but with a higher clock. Look at the gpu utilization during these benchmarks of games and demos. It will almost always be 99%. Because even with higher clocks on the same chip with the same CU count, the clocks define the performance gains.
You are misunderstanding what that statistic means. I have made that mistake before and was corrected.

% Utilization you commonly see in benchmark tools is just a trace of a kernel(s) that were executed on the GPU over a given time period. Let us use 1 second for example. Let's say program X was executed on the GPU and the program took 40ms to execute and complete, the percent utilization of the GPU (time) over a 1-second time span would be 66%. This statistic does not tell you how many transistors were flipped and it would be nowhere near 90% as that would draw a shit ton of power and burn the GPU out.

GPU manufacturers create GPU under a universally accepted rule that there is no perfect code. Power viruses are an example of a code that emulates lots of transistors flipping.


utilization.gpuPercent of time over the past sample period during which one or more kernels was executing on the GPU. The sample period may be between 1 second and 1/6 second depending on the product.
 
Last edited:

Md Ray

Member
Nov 12, 2016
4,128
13,309
785
You are misunderstanding what that statistic means. I have made that mistake before and was corrected.

% Utilization you commonly see in benchmark tools is just a trace of a kernel(s) that were executed on the GPU over a given time period. Let us use 1 second for example. Let's say program X was executed on the GPU and the program took 40ms to execute and complete, the percent utilization of the GPU (time) over a 1-second time span would be 66%. This statistic does not tell you how many transistors were flipped and it would be nowhere near 90% as that would draw a shit ton of power and burn the GPU out.

GPU manufacturers create GPU under a universally accepted rule that there is no perfect code. Power viruses are an example of a code that emulates lots of transistors flipping.


utilization.gpuPercent of time over the past sample period during which one or more kernels was executing on the GPU. The sample period may be between 1 second and 1/6 second depending on the product.
Good post. Bookmarked this for future reference.
 
Last edited:

rnlval

Member
Jun 26, 2017
1,375
1,123
460
Sector 001
gpucuriosity.wordpress.com
Yes, because that's where the "2 Ops per clock" in the TFLOPs calculation comes from. It's completely unrealistic to expect a GPU to just constantly compute FMA OPs and nothing else. Real-world workloads are much more varied.
The argument between FMA and non-FMA wouldn't change the basic position when one of the hardware platforms has less compute power. Compute power can be bound by memory bandwidth issues. Pixel grid array is one of many easy workloads to be parallelized.
 
Last edited:

rnlval

Member
Jun 26, 2017
1,375
1,123
460
Sector 001
gpucuriosity.wordpress.com
You've written an essay to explain what?
Teraflops are calculated by the amount of floating point instructions the GPU can perform per second. To reach it's maximum limit it would have to perform with everything working at 100% all the time. This is about games. Games aren't made exclusively with TFs. Because as soon as you start doing other game related work do you think the GPU will still reach the maximum TF? So just imagine how many tasks a GPU is doing every second in a game? So do you still think the TF maximum will be reached?
I think your confusing hitting max clocks with utilising a GPU 100%. It's impossible because there isn't any code written that could.
Most games are driven mostly by TOPS (Terra Operations Per second) that includes integer and floating-point data types. Issues with ROPS power involve read-write IO bottlenecks that can be overcome by TMU read-write IO path, but the argument for raw ROPS and other read-write capabilities hide a major bottleneck that is external memory bandwidth.

AMD's push for Async Compute (using TMU IO path connected to L2 cache) marketing is just a cover for GCN ROPS IO path design weakness during Maxwell/Pascal era. Maxwell/Pascal's TMU and ROPS IO path has connections to multi-MB size L2 cache.

AMD's Async Compute marketing drive was silently dropped when NAVI 21 was released with 128 ROPS and enough for 4K frame buffer's 128 MB L3 cache (with "DCC everywhere"). But the majority of DirectX12U's new features are towards compute shader path with hardware-accelerated functions. AMD NAVI 2X design is fighting the last-gen PC GPU battles when NVIDIA moved the goal post.

There's also a design issue GCN Wave64 when it needs 4 clock cycles throughput to pass ALUs while NVIDA's CUDA Warp32 when it needs 1 clock cycle throughput to pass ALUs. AMD GCN's wave64 has higher needs for many wavefronts to hide ALU pipeline latency.
 
Last edited:

twilo99

Member
Mar 9, 2021
998
1,029
330
The 6600xt is the lowest tier RDNA2 card and it seems like its shipping with VRR... I am still perplexed as to why the PS5 didn't ship with it since its part of the architecture and still missing almost a year later.
 

rnlval

Member
Jun 26, 2017
1,375
1,123
460
Sector 001
gpucuriosity.wordpress.com
I don't see how this has anything to do with what I just said.
When both machines execute non-FMA workloads, both machines' TFLOPS drops, hence the argument between FMA and non-FMA wouldn't change the basic position.

12.147 TFLOPS potential drops to 6.0735 TFLOPS potential with 100% non-FMA.
10.28 TFLOPS potential drops to 5.14 TFLOPS potential with 100% non-FMA.
 
Last edited:

PaintTinJr

Member
Jan 30, 2020
1,219
2,699
475
Oxfordshire, England
When both machines execute non-FMA workloads, both machines' TFLOPS drops, hence the argument between FMA and non-FMA wouldn't change the basic position.

12.147 TFLOPS potential drops to 6.0735 TFLOPS potential with 100% non-FMA.
10.28 TFLOPS potential drops to 5.14 TFLOPS potential with 100% non-FMA.
I'm not sure that's entirely true - because even in non-FMA workloads power draw is a concern and the second system being able to switch clock multiple times per frame IIRC will have higher efficiency/throughput in transforming watts into work done by transistors.

AFAIK in the 12.147TFLOPS it will either limit power usage by the DirectX translation layer or throttling bandwidth, etc to avoid exceeding the power budget, when non-fma code would still otherwise demand too much watts at that fixed clock.
 

PaintTinJr

Member
Jan 30, 2020
1,219
2,699
475
Oxfordshire, England
Most games are driven mostly by TOPS (Terra Operations Per second) that includes integer and floating-point data types. Issues with ROPS power involve read-write IO bottlenecks that can be overcome by TMU read-write IO path, but the argument for raw ROPS and other read-write capabilities hide a major bottleneck that is external memory bandwidth.
AMD's push for Async Compute (using TMU IO path connected to L2 cache) marketing is just a cover for GCN ROPS IO path design weakness during Maxwell/Pascal era. Maxwell/Pascal's TMU and ROPS IO path has connections to multi-MB size L2 cache.
Yeah, but the Integer operations per second are as much about the CPU and the coupling/cohesion of those TOPS driven by the CPU and GPU IMHO. So with AMD being significantly stronger positioned in the brawny CPU market, their considerations for design - I presume - to be different from Nvidia that are learning more on their GPU technology - with more redundancy for a Brawny CPU - to compliment weaker ARM CPU tech they are doing APUs with IMO.

I also think AMD are more inline with Carmack/Sweeny's 10year old comments regarding general purpose graphics programming, and the "design weakness" you refer to is actually a difference in design choice - where Nvidia seem to be clinging more to dedicated ASICs to keep their market position IMO
AMD's Async Compute marketing drive was silently dropped when NAVI 21 was released with 128 ROPS and enough for 4K frame buffer's 128 MB L3 cache (with "DCC everywhere"). But the majority of DirectX12U's new features are towards compute shader path with hardware-accelerated functions. AMD NAVI 2X design is fighting the last-gen PC GPU battles when NVIDIA moved the goal post.

There's also a design issue GCN Wave64 when it needs 4 clock cycles throughput to pass ALUs while NVIDA's CUDA Warp32 when it needs 1 clock cycle throughput to pass ALUs. AMD GCN's wave64 has higher needs for many wavefronts to hide ALU pipeline latency.
The clock cycle count required for each looks consistent with the design strategy difference of: general purpose versus dedicated ASICs IMO, ASICs gain in performance by reducing clock cycles at the cost of flexibility.

Given the way console gaming tends to evolve the graphics software tech recursively - multiple times - each console generation, and with AMD being aligned to both PlayStation and Xbox I believe AMD think their more generalised compute strategy is correct for the longterm - because it is informed by the two companies that buy the hardware from them that sets the baseline for AAA game development, and Epic's UE5 is also pushing that design philosophy too, IMO.
 
Mar 7, 2017
3,051
6,374
520
When both machines execute non-FMA workloads, both machines' TFLOPS drops, hence the argument between FMA and non-FMA wouldn't change the basic position.

12.147 TFLOPS potential drops to 6.0735 TFLOPS potential with 100% non-FMA.
10.28 TFLOPS potential drops to 5.14 TFLOPS potential with 100% non-FMA.

I never made an argument contrary to this. I agree with you.
 

rnlval

Member
Jun 26, 2017
1,375
1,123
460
Sector 001
gpucuriosity.wordpress.com
Yeah, but the Integer operations per second are as much about the CPU and the coupling/cohesion of those TOPS driven by the CPU and GPU IMHO. So with AMD being significantly stronger positioned in the brawny CPU market, their considerations for design - I presume - to be different from Nvidia that are learning more on their GPU technology - with more redundancy for a Brawny CPU - to compliment weaker ARM CPU tech they are doing APUs with IMO.

I also think AMD are more inline with Carmack/Sweeny's 10year old comments regarding general purpose graphics programming, and the "design weakness" you refer to is actually a difference in design choice - where Nvidia seem to be clinging more to dedicated ASICs to keep their market position IMO

The clock cycle count required for each looks consistent with the design strategy difference of: general purpose versus dedicated ASICs IMO, ASICs gain in performance by reducing clock cycles at the cost of flexibility.

Given the way console gaming tends to evolve the graphics software tech recursively - multiple times - each console generation, and with AMD being aligned to both PlayStation and Xbox I believe AMD think their more generalised compute strategy is correct for the longterm - because it is informed by the two companies that buy the hardware from them that sets the baseline for AAA game development, and Epic's UE5 is also pushing that design philosophy too, IMO.

Integer (e.g. INT32) shader operations didn't disappear in GPU shader programs.



AMD GPU's ALUs execute both Integer and floating-point.




NAVI 21's INT24 and FP32 have almost the same 1 to 1 performance. INT24 targets 24-bit color data. NAVI 21's INT32 for some reason is crippled.
 

rnlval

Member
Jun 26, 2017
1,375
1,123
460
Sector 001
gpucuriosity.wordpress.com
I'm not sure that's entirely true - because even in non-FMA workloads power draw is a concern and the second system being able to switch clock multiple times per frame IIRC will have higher efficiency/throughput in transforming watts into work done by transistors.

AFAIK in the 12.147TFLOPS it will either limit power usage by the DirectX translation layer or throttling bandwidth, etc to avoid exceeding the power budget, when non-fma code would still otherwise demand too much watts at that fixed clock.
Direct3D API layer is not a major issue due to vendor-specific API access.

For example


Doom 2016 (Vulkan PC) was the launch title. This direct access feature was missing on AMD's Mantle API.



Xbox One/Series SX GPU has an extra microcode engine to translate DirectX12 calls to GPU's ISA without CPU's involvement.


The translation between Direct3D ASM to GPU instruction set mostly impacts CPU's overhead.

A translation layer (with good resource tracking) would be needed when evolving GPU without constraints like on PS5's 36 CU backward compatibility requirement that is double the PS4's 18 CU count.

Xbox Series X's 52 CU count shows it's not limited by the multiples of XBO's 12 CU count or multiples of X1X's 40 CU count.

NVIDIA also has shader intrinsics access

During the port of the EGO® engine to next-gen console, we discovered that warp/wave-level operations enabled substantial optimisations to our light culling system. We were excited to learn that NVIDIA offered production-quality, ready-to-use HLSL extensions to access the same functionality on GeForce GPUs. We were able to exploit the same warp vote and lane access functionality as we had done on console, yielding wins of up to 1ms at 1080p on a GTX 980. We continue to find new optimisations to exploit these intrinsics.

-------------------------
Note that DirectX 12's Shader Model 6.x exposes warp/wave-level operations for AMD and NVIDIA GPUs.
 
Last edited:

PaintTinJr

Member
Jan 30, 2020
1,219
2,699
475
Oxfordshire, England
Integer (e.g. INT32) shader operations didn't disappear in GPU shader programs.



AMD GPU's ALUs execute both Integer and floating-point.




NAVI 21's INT24 and FP32 have almost the same 1 to 1 performance. INT24 targets 24-bit color data. NAVI 21's INT32 for some reason is crippled.
I'm struggling to make the connection between what I wrote and what you have replied with, and whether you are agreeing or disagreeing - I was talking in broader terms as the specifics of each hardware mean little compared to relative performance of say...cutting edge general purpose graphics software such as nanite/lumen when comparing AMD's top card to Nvidia's.

My view is that the rendering techniques are only going to get more generalised from here on, and with worldwide initiatives in place to limit power use on electrical devices, having GPU silicon that might be idle, and gives rise to much wider sets of TDP is probably the wrong solution going forward. AMD's hardware is quite power efficient, and the solution in PS5 with constant power by deterministic clock (per workload) seems like a logical step in the right direction too - compared to Nvidia IMO.

IMO (going forward) UE5 seems to demonstrate that AMD and Nvidia are similar for performance, but Nvidia's setup is less efficient by silicon used and power.
 

MonarchJT

Banned
Sep 25, 2020
2,971
4,589
410
Your dumb as fuck. Who is talking about overclocking? Have I mentioned anything about overclocking the GPU? So do you think by overclocking the 3090 you will hit it's 35.5tf theoretical performance?
This is about the theoretical 12.15 teraflops peak performance that keeps on getting bandied about on here. That the SX can achieve it because of its fixed clock speed.
Which I explained no GPU can reach it's theoretical teraflops number because everything would have to work at its maximum speed to achieve it. Imagine feeding all 52 CU's and keeping them at maximum all the time? You would run into bottlenecks long before and overheating with downclocking kicking in.
As I stated before the SX will have an advantage in compute and RT. About 10-15% on average.
reported you never change
 
  • Empathy
Reactions: arvfab

MonarchJT

Banned
Sep 25, 2020
2,971
4,589
410
No idea what this has to do with what I posted. I never questioned PS5's variable clock speeds. I never even mentioned the PS5 in that post. You are either confusing me with someone else or you think I am an xbox fanboy just because I dared to defend GPU's hitting their theoretical tflops max. Something we just took for granted when the PS4 said they were 1.84 tflops because their clock was 800 mhz and they had 18 CUs, and when the PS4 Pro came out and they said it was now 36 CUs at 911 Mhz. We consistently saw the PS4 have a 40% resolution advantage over the x1 last gen which settled around 900p after a rough first year. Then we consistently saw the Ps4 Pro offer 2x more resolution than the PS4 in line with the 2.2x increase in theoretical tflops. Then we saw the X1X consistently offer a 40% increase in pixels over the PS4 consistent with its theoritical tflops difference over the PS4 Pro.

But now we are throwing away a decade of console tflops performance results based on theoretical tflops because?
because the PS5 have a objectively weaker gpu and even if there is nothing wrong with admitting this simple and objective truth .. we have to spend hours and hours, months after months reading absurd spins that try to alleviate this deficit in the eyes of fanboys
 
Last edited:
Jun 1, 2016
2,848
3,742
795
absolutely yes...in fact it is just like this
I assumed it would be like cpu cores. Takes time to spread the load but after a while they get better at it.
I do know it's much easier to double cu's compared to doubling clocks.
Just look at the 6900 and it's 80 cu's.
 

PaintTinJr

Member
Jan 30, 2020
1,219
2,699
475
Oxfordshire, England
Direct3D API layer is not a major issue due to vendor-specific API access.

For example


Doom 2016 (Vulkan PC) was the launch title. This direct access feature was missing on AMD's Mantle API.



Xbox One/Series SX GPU has an extra microcode engine to translate DirectX12 calls to GPU's ISA without CPU's involvement.


The translation between Direct3D ASM to GPU instruction set mostly impacts CPU's overhead.

A translation layer (with good resource tracking) would be needed when evolving GPU without constraints like on PS5's 36 CU backward compatibility requirement that is double the PS4's 18 CU count.

Xbox Series X's 52 CU count shows it's not limited by the multiples of XBO's 12 CU count or multiples of X1X's 40 CU count.

NVIDIA also has shader intrinsics access

During the port of the EGO® engine to next-gen console, we discovered that warp/wave-level operations enabled substantial optimisations to our light culling system. We were excited to learn that NVIDIA offered production-quality, ready-to-use HLSL extensions to access the same functionality on GeForce GPUs. We were able to exploit the same warp vote and lane access functionality as we had done on console, yielding wins of up to 1ms at 1080p on a GTX 980. We continue to find new optimisations to exploit these intrinsics.

-------------------------
Note that DirectX 12's Shader Model 6.x exposes warp/wave-level operations for AMD and NVIDIA GPUs.
Again, I'm not sure of the relevance of your response

The XsX workloads that run on the silicon are generated by the translation layer, so it can't generate a workload that will allow the game software to exceed the power limits of the XsX, so as I mentioned in my post you quoted, either the DX12 translation layer will limit power when a non-FMA workload would exceed the power limit draw of the GPU in the XsX, or the system will throttle the bandwidth because of the fixed clock.

My comment was more to illustrate that both systems operate differently and AFAIK don't scale down equally as you said, and that the fixed clock narrative that Xbox gave following the confusion about the PS5 was completely misleading because of how the system will limit GPU performance elsewhere under too much load.
 

Loxus

Member
Sep 18, 2020
700
2,901
345
The Caribbean
I did some research on Infinity Cache and Unified L3 Cache and realized we did all this fuss about the PS5 having/no having, without really understanding if they needed or not.

L2 vs. L3 cache: What’s the Difference?
At the simplest level, an L3 cache is just a larger, slower version of the L2 cache. Back when most chips were single-core processors, this was generally true. The first L3 caches were actually built on the motherboard itself, connected to the CPU via the back-side bus (as distinct from the front-side bus). When AMD launched its K6-III processor family, many existing K6/K-2 motherboards could accept a K6-III as well. Typically these boards had 512K-2MB of L2 cache — when a K6-III, with its integrated L2 cache was inserted, these slower, motherboard-based caches became L3 instead.

The reason I mention motherboard-based caches, is because the PS5 has 512MB DDR4 Cache.

This Cache most likely the reason the SSD is low latency/high bandwidth.



AMD’s Ryzen processors based on the Zen, Zen+, and Zen 2 cores all share a common L3, but the structure of AMD’s CCX modules left the CPU functioning more like it had 2x8MB L3 caches, one for each CCX cluster, as opposed to one large, unified L3 cache like a standard Intel CPU.


Private L1/L2 caches and a shared L3 is hardly the only way to design a cache hierarchy, but it’s a common approach that multiple vendors have adopted. Giving each individual core a dedicated L1 and L2 cuts access latencies and reduces the chance of cache contention — meaning two different cores won’t overwrite vital data that the other put in a location in favor of their own workload. The common L3 cache is slower but much larger, which means it can store data for all the cores at once. Sophisticated algorithms are used to ensure that Core 0 tends to store information closest to itself, while Core 7 across the die also puts necessary data closer to itself.

Having Unified L3 Cache doesn't really matter, as the Cores will still mostly only access data closest to itself. And with the way the PS5 is laid out, having one large pool of L3 Cache may not be viable.



Unlike the L1 and L2, which are nearly always CPU-focused and private, the L3 can also be shared with other devices or capabilities. Intel’s Sandy Bridge CPUs shared an 8MB L3 cache with the on-die graphics core (Ivy Bridge gave the GPU its own dedicated slice of L3 cache in lieu of sharing the entire 8MB). Intel’s Tiger Lake documentation indicates that the onboard CPU cache can also function as a LLC for the GPU.

The PS5's GPU having access to the CPU's L3 Cache isn't farfetched and been a thing for awhile. Especially with the use of Infinity Fabric.


Here is how things get interesting by not having Infinity Cache.
Infinity Cache, Discover Its Usefulness, Operation and Secrets

The Infinity Cache is a Victim Cache

The Victim Caché idea is a legacy of CPUs under Zen architectures that has been adapted to RDNA 2.

Victim Cache Zen


In the Zen cores the L3 Cache is what we call a Victim Caché, these are in charge of collecting the cache lines discarded from the L2 instead of being part of the usual cache hierarchy. That is to say, in Zen cores the data that comes from RAM does not follow the path RAM → L3 → L2 → L1 or vice versa, but instead follows the path RAM → L2 → L1 since the L3 cache acts as Victim Caché.

In the case of the Infinity Cache, the idea is to rescue the lines of the L2 Cache of the GPU without having to access the VRAM , which allows the energy consumed per instruction to be much lower and therefore higher speeds can be achieved.

Infinity Cache Consumo


However, although the capacity of 128 MB may seem very high, it does not seem enough to avoid that all the discarded lines end up in the VRAM, since in the best of cases it only manages to rescue 58% . This means that in future iterations of its RDNA architecture it is very likely that AMD will increase the capacity of this Infinity Cache .


So having Infinity Cache in PS5 isn't feasible, because having not enough Cache, the discarded lines will still end up in VRAM. By increasing the capacity isn't feasible either, because it will dramatically increase the size of the SOC.

So to combat not having Infinity Cache, Cerny implemented Cache Scrubbers.

Inside PlayStation 5: the specs and the tech that deliver Sony's next-gen vision
"Coherency comes up in a lot of places, probably the biggest coherency issue is stale data in the GPU caches," explains Cerny in his presentation. "Flushing all the GPU caches whenever the SSD is read is an unattractive option - it could really hurt the GPU performance - so we've implemented a gentler way of doing things, where the coherency engines inform the GPU of the overwritten address ranges and custom scrubbers in several dozen GPU caches do pinpoint evictions of just those address ranges."

Basically what this means in regards to Infinity Cache. Instead of flushing the GPU caches and putting all the discarded lines into the L3 Cache. The Cache Scrubbers only evict the over written lines.

We only want features to say my console have something your console doesn't, without fully understanding if it's really needed or not.
 

rnlval

Member
Jun 26, 2017
1,375
1,123
460
Sector 001
gpucuriosity.wordpress.com
Again, I'm not sure of the relevance of your response

The XsX workloads that run on the silicon are generated by the translation layer, so it can't generate a workload that will allow the game software to exceed the power limits of the XsX, so as I mentioned in my post you quoted, either the DX12 translation layer will limit power when a non-FMA workload would exceed the power limit draw of the GPU in the XsX, or the system will throttle the bandwidth because of the fixed clock.

My comment was more to illustrate that both systems operate differently and AFAIK don't scale down equally as you said, and that the fixed clock narrative that Xbox gave following the confusion about the PS5 was completely misleading because of how the system will limit GPU performance elsewhere under too much load.
Not a complete narrative.

Like Xbox 360's microcode access, Xbox One XDK shader intrinsic functions and it has Direct3D12 layer is hardware accelerated with a semi-custom DirectX12 microcode engine.

You just ignored IDsoftware's Tiago Sousa's statement on this issue.


Notices "consoles" i.e. more than one console.


For the PC, AMD release direct access shader intrinsic extensions while NVIDIA revealed their existing shader intrinsics extensions.
 
Last edited:
  • Thoughtful
Reactions: PaintTinJr

rnlval

Member
Jun 26, 2017
1,375
1,123
460
Sector 001
gpucuriosity.wordpress.com
I did some research on Infinity Cache and Unified L3 Cache and realized we did all this fuss about the PS5 having/no having, without really understanding if they needed or not.

L2 vs. L3 cache: What’s the Difference?
At the simplest level, an L3 cache is just a larger, slower version of the L2 cache. Back when most chips were single-core processors, this was generally true. The first L3 caches were actually built on the motherboard itself, connected to the CPU via the back-side bus (as distinct from the front-side bus). When AMD launched its K6-III processor family, many existing K6/K-2 motherboards could accept a K6-III as well. Typically these boards had 512K-2MB of L2 cache — when a K6-III, with its integrated L2 cache was inserted, these slower, motherboard-based caches became L3 instead.

The reason I mention motherboard-based caches, is because the PS5 has 512MB DDR4 Cache.

This Cache most likely the reason the SSD is low latency/high bandwidth.


AMD’s Ryzen processors based on the Zen, Zen+, and Zen 2 cores all share a common L3, but the structure of AMD’s CCX modules left the CPU functioning more like it had 2x8MB L3 caches, one for each CCX cluster, as opposed to one large, unified L3 cache like a standard Intel CPU.


Private L1/L2 caches and a shared L3 is hardly the only way to design a cache hierarchy, but it’s a common approach that multiple vendors have adopted. Giving each individual core a dedicated L1 and L2 cuts access latencies and reduces the chance of cache contention — meaning two different cores won’t overwrite vital data that the other put in a location in favor of their own workload. The common L3 cache is slower but much larger, which means it can store data for all the cores at once. Sophisticated algorithms are used to ensure that Core 0 tends to store information closest to itself, while Core 7 across the die also puts necessary data closer to itself.


Having Unified L3 Cache doesn't really matter, as the Cores will still mostly only access data closest to itself. And with the way the PS5 is laid out, having one large pool of L3 Cache may not be viable.


Unlike the L1 and L2, which are nearly always CPU-focused and private, the L3 can also be shared with other devices or capabilities. Intel’s Sandy Bridge CPUs shared an 8MB L3 cache with the on-die graphics core (Ivy Bridge gave the GPU its own dedicated slice of L3 cache in lieu of sharing the entire 8MB). Intel’s Tiger Lake documentation indicates that the onboard CPU cache can also function as a LLC for the GPU.

The PS5's GPU having access to the CPU's L3 Cache isn't farfetched and been a thing for awhile. Especially with the use of Infinity Fabric.


Here is how things get interesting by not having Infinity Cache.
Infinity Cache, Discover Its Usefulness, Operation and Secrets

The Infinity Cache is a Victim Cache

The Victim Caché idea is a legacy of CPUs under Zen architectures that has been adapted to RDNA 2.

In the Zen cores the L3 Cache is what we call a Victim Caché, these are in charge of collecting the cache lines discarded from the L2 instead of being part of the usual cache hierarchy.
That is to say, in Zen cores the data that comes from RAM does not follow the path RAM → L3 → L2 → L1 or vice versa, but instead follows the path RAM → L2 → L1 since the L3 cache acts as Victim Caché.

In the case of the Infinity Cache, the idea is to rescue the lines of the L2 Cache of the GPU without having to access the VRAM , which allows the energy consumed per instruction to be much lower and therefore higher speeds can be achieved.

However, although the capacity of 128 MB may seem very high, it does not seem enough to avoid that all the discarded lines end up in the VRAM, since in the best of cases it only manages to rescue 58% . This means that in future iterations of its RDNA architecture it is very likely that AMD will increase the capacity of this Infinity Cache .


So having Infinity Cache in PS5 isn't feasible, because having not enough Cache, the discarded lines will still end up in VRAM. By increasing the capacity isn't feasible either, because it will dramatically increase the size of the SOC.

So to combat not having Infinity Cache, Cerny implemented Cache Scrubbers.

Inside PlayStation 5: the specs and the tech that deliver Sony's next-gen vision
"Coherency comes up in a lot of places, probably the biggest coherency issue is stale data in the GPU caches," explains Cerny in his presentation. "Flushing all the GPU caches whenever the SSD is read is an unattractive option - it could really hurt the GPU performance - so we've implemented a gentler way of doing things, where the coherency engines inform the GPU of the overwritten address ranges and custom scrubbers in several dozen GPU caches do pinpoint evictions of just those address ranges."

Basically what this means in regards to Infinity Cache. Instead of flushing the GPU caches and putting all the discarded lines into the L3 Cache. The Cache Scrubbers only evict the over written lines.

We only want features to say my console have something your console doesn't, without fully understanding if it's really needed or not.

AMD's CPU L3 cache being accessed by iGPU is not as tightly integrated like on Intel's ring bus that has client nodes such as L3 cache, iGPU, and CPUs.



For about 2 TFLOPS iGPU market segment, Intel Tigerlake Xe beats AMD VEGA iGPU on OpenCL.

-------------------------
PC RDNA 2 Infinity Cache's design goals.



By having a larger cache, AMD minimizes the number of trips to GDDR6 memory. Those trips are physically further and off-chip so they use more power and have higher latency. As a result, the larger L3 cache means the architecture is feeding the compute units more efficiently.

Notice PC RDNA 2's Infinity Cache (L3 cache) has links to external memory i.e. GDDR6. AMD shows "RAM → L3 → L2 → L1". PC RDNA 2's Infinity Cache is not a direct copy-and-paste Zen/Zen 2/Zen 3's L3 cache.

Remember, XBO's 32 MB eSRAM is large enough for typical 1600x900p frame buffers without DCC (delta color compression). PC RDNA 2's 128 MB Infinity Cache with DCC is at least four times of XBO's 32 MB eSRAM without DCC.

NAVI 21's 128 MB Infinity Cache size was deliberate for 4K PC gaming.
 
Last edited:
  • Thoughtful
Reactions: PaintTinJr

rnlval

Member
Jun 26, 2017
1,375
1,123
460
Sector 001
gpucuriosity.wordpress.com
I'm struggling to make the connection between what I wrote and what you have replied with, and whether you are agreeing or disagreeing - I was talking in broader terms as the specifics of each hardware mean little compared to relative performance of say...cutting edge general purpose graphics software such as nanite/lumen when comparing AMD's top card to Nvidia's.

My view is that the rendering techniques are only going to get more generalised from here on, and with worldwide initiatives in place to limit power use on electrical devices, having GPU silicon that might be idle, and gives rise to much wider sets of TDP is probably the wrong solution going forward. AMD's hardware is quite power efficient, and the solution in PS5 with constant power by deterministic clock (per workload) seems like a logical step in the right direction too - compared to Nvidia IMO.

IMO (going forward) UE5 seems to demonstrate that AMD and Nvidia are similar for performance, but Nvidia's setup is less efficient by silicon used and power.
The recent UE5 demo with PS5 influence doesn't show large-scale hardware RT usage.
 

PaintTinJr

Member
Jan 30, 2020
1,219
2,699
475
Oxfordshire, England
The recent UE5 demo with PS5 influence doesn't show large-scale hardware RT usage.
Large scale hardware RT usage in UE5 cripples frame-rates on everything and it isn't general purpose - as it doesn't use the nanite micro-polygon geometry with HW RT, and instead falls back to vertex pipeline geometry. My 12 core Xeon/3060/32GB Ram/980 Pro drops to 2-4fps with HW RT enabled in UE5 Land of the Ancient demo, regardless of resolution and other performance settings. The data processing to do the hw RT is seemingly the bottleneck, as a new default FPS project with all lighting enable runs at about 90fps IIRC.

UE5 nanite/lumen provides an excellent benchmark of consoles and the latest offerings from on PC from AMD and Nvidia IMHO, and a view of the future performance trends in rendering.
 

PaintTinJr

Member
Jan 30, 2020
1,219
2,699
475
Oxfordshire, England
Not a complete narrative.

Like Xbox 360's microcode access, Xbox One XDK shader intrinsic functions and it has Direct3D12 layer is hardware accelerated with a semi-custom DirectX12 microcode engine.

You just ignored IDsoftware's Tiago Sousa's statement on this issue.


Notices "consoles" i.e. more than one console.


For the PC, AMD release direct access shader intrinsic extensions while NVIDIA revealed their existing shader intrinsics extensions.
I'm not saying the info you have linked isn't of interest, but it is at a tangent to the point being made.
Performance is a vague term in this discussion without a point of reference to measure, unlike stating more performance was had in getting from 5fps to 7fps.

I was specifically talking about transforming power into work done, because Cerny described it as a paradigm shift - ie nothing else is pre-emptively maintaining power draw by adjusting clockspeed. The idsoftware developer isn't saying that the console can conjure more thermal headroom and draw power beyond the limit of the power supplies in those respective consoles, whether they get 5% more performance on their previous abstract efforts, and that's why I didn't respond to that reference quote you provided.

in the non-FMA situation you described, at loads that exceed the 315watt (IIRC) PSU in the XsX, the Xbox will throttle harder to keep within its power limits than the PS5 going by what Cerny described in the Road to PS5, and in the DF -post Road to PS5 - interview.
 

PaintTinJr

Member
Jan 30, 2020
1,219
2,699
475
Oxfordshire, England
I did some research on Infinity Cache and Unified L3 Cache and realized we did all this fuss about the PS5 having/no having, without really understanding if they needed or not.

L2 vs. L3 cache: What’s the Difference?
At the simplest level, an L3 cache is just a larger, slower version of the L2 cache. Back when most chips were single-core processors, this was generally true. The first L3 caches were actually built on the motherboard itself, connected to the CPU via the back-side bus (as distinct from the front-side bus). When AMD launched its K6-III processor family, many existing K6/K-2 motherboards could accept a K6-III as well. Typically these boards had 512K-2MB of L2 cache — when a K6-III, with its integrated L2 cache was inserted, these slower, motherboard-based caches became L3 instead.

The reason I mention motherboard-based caches, is because the PS5 has 512MB DDR4 Cache.

This Cache most likely the reason the SSD is low latency/high bandwidth.



AMD’s Ryzen processors based on the Zen, Zen+, and Zen 2 cores all share a common L3, but the structure of AMD’s CCX modules left the CPU functioning more like it had 2x8MB L3 caches, one for each CCX cluster, as opposed to one large, unified L3 cache like a standard Intel CPU.


Private L1/L2 caches and a shared L3 is hardly the only way to design a cache hierarchy, but it’s a common approach that multiple vendors have adopted. Giving each individual core a dedicated L1 and L2 cuts access latencies and reduces the chance of cache contention — meaning two different cores won’t overwrite vital data that the other put in a location in favor of their own workload. The common L3 cache is slower but much larger, which means it can store data for all the cores at once. Sophisticated algorithms are used to ensure that Core 0 tends to store information closest to itself, while Core 7 across the die also puts necessary data closer to itself.

Having Unified L3 Cache doesn't really matter, as the Cores will still mostly only access data closest to itself. And with the way the PS5 is laid out, having one large pool of L3 Cache may not be viable.



Unlike the L1 and L2, which are nearly always CPU-focused and private, the L3 can also be shared with other devices or capabilities. Intel’s Sandy Bridge CPUs shared an 8MB L3 cache with the on-die graphics core (Ivy Bridge gave the GPU its own dedicated slice of L3 cache in lieu of sharing the entire 8MB). Intel’s Tiger Lake documentation indicates that the onboard CPU cache can also function as a LLC for the GPU.

The PS5's GPU having access to the CPU's L3 Cache isn't farfetched and been a thing for awhile. Especially with the use of Infinity Fabric.


Here is how things get interesting by not having Infinity Cache.
Infinity Cache, Discover Its Usefulness, Operation and Secrets

The Infinity Cache is a Victim Cache

The Victim Caché idea is a legacy of CPUs under Zen architectures that has been adapted to RDNA 2.

Victim Cache Zen


In the Zen cores the L3 Cache is what we call a Victim Caché, these are in charge of collecting the cache lines discarded from the L2 instead of being part of the usual cache hierarchy. That is to say, in Zen cores the data that comes from RAM does not follow the path RAM → L3 → L2 → L1 or vice versa, but instead follows the path RAM → L2 → L1 since the L3 cache acts as Victim Caché.

In the case of the Infinity Cache, the idea is to rescue the lines of the L2 Cache of the GPU without having to access the VRAM , which allows the energy consumed per instruction to be much lower and therefore higher speeds can be achieved.

Infinity Cache Consumo


However, although the capacity of 128 MB may seem very high, it does not seem enough to avoid that all the discarded lines end up in the VRAM, since in the best of cases it only manages to rescue 58% . This means that in future iterations of its RDNA architecture it is very likely that AMD will increase the capacity of this Infinity Cache .


So having Infinity Cache in PS5 isn't feasible, because having not enough Cache, the discarded lines will still end up in VRAM. By increasing the capacity isn't feasible either, because it will dramatically increase the size of the SOC.

So to combat not having Infinity Cache, Cerny implemented Cache Scrubbers.

Inside PlayStation 5: the specs and the tech that deliver Sony's next-gen vision
"Coherency comes up in a lot of places, probably the biggest coherency issue is stale data in the GPU caches," explains Cerny in his presentation. "Flushing all the GPU caches whenever the SSD is read is an unattractive option - it could really hurt the GPU performance - so we've implemented a gentler way of doing things, where the coherency engines inform the GPU of the overwritten address ranges and custom scrubbers in several dozen GPU caches do pinpoint evictions of just those address ranges."

Basically what this means in regards to Infinity Cache. Instead of flushing the GPU caches and putting all the discarded lines into the L3 Cache. The Cache Scrubbers only evict the over written lines.

We only want features to say my console have something your console doesn't, without fully understanding if it's really needed or not.
Very interesting about it being a victim cache. I'm wondering if the 58% is typically retained as a natural function of the data changing between one shader workload and the next.

With the IO complex linkage between the CPU, GPU,RAM and SSD controller/DDR4 cache, and it having its own esram, is it possible the esram is a small shared L3, with the DDR4 being like an L4 cache (LLC)?
 
Sep 25, 2016
2,208
744
590
Have a play with Visual6502

But even when you know how these things work, it still seems like alien technology. Just the GPU on that chip has 18 billion transistors and miles of wire, and you can just about make that out in the picture. We have technology which creates tools which develops another technology which creates tools to enable us to make another technology and so forth, it's surprisingly fragile and we take a lot of it for granted, once we stopped manufacturing CRT displays we forgot how to make them within 10-20 years, we would have to pretty much reverse engineer one and go back to the manufacturing drawing board to make another one. If for whatever reason all technology for CPU's vanished and we only had our knowledge, it would still take us decades to get back to where we are.
Sorry to quote an old post but are there any good YouTube documentaries on the birth and evolution of chipsets? Many thanks.