Support NeoGAF

DonMigs85 · Oct 8, 2020

Lysandros said:
XSX has 64 ROPs confirmed.

Dang, thought it was 80. So PS5 can actually beat it in raw fillrate but lags in memory bandwidth and shader power

LordOfChaos · Oct 8, 2020

Lysandros said:
XSX has 64 ROPs confirmed.

Going to send a correction to the Techpowerup GPU database guy, seems to be leading some astray

AMD Xbox Series X GPU Specs

AMD Scarlett, 1825 MHz, 3328 Cores, 208 TMUs, 64 ROPs, 10240 MB GDDR6, 1750 MHz, 320 bit

www.techpowerup.com

Jigsaah · Oct 8, 2020

These ACTUAL multi-plat game comparisons can't come soon enough. I am so fucking tired of speculation.

FunkMiller · Oct 8, 2020

Jigsaah said:
These ACTUAL multi-plat game comparisons can't come soon enough. I am so fucking tired of speculation.

Things will be a lot clearer and more straight forward once we actually see next gen games on the XsX. It’s the only piece of the unknown we still have that’s important.

FrankWza · Oct 8, 2020

Soapy Wooder said:
No need to evolve then. Just don’t devolve into the lower life console warrior.

but, how does evolution not stagnate when it Theres no demons souls remake on 11/12-19/20?

ethomaz · Oct 8, 2020

DonMigs85 said:
I think the Series X has more ROPs and TMUs to compensate for its clock deficit right? Heck even the One X had half the ROPs of PS4 Pro but the memory bandwidth and shader count more than made up for it

It has the same amount of ROPs (64).
TMUs are tied to the CU so it indeed has more.

Nhranaghacon · Oct 8, 2020

The correct contextual performance metric for a single teraflop is 4 trillion calculations a second.

Attempting the redefine/confuse the teraflop metric by citing CU count's are worthless even when it take's less CU's to produce more Teraflops. Teraflop's the metric that matters - CU count's will always effectively grow or shrink with architecture improvements - as needed - but should not be considered the relevant performance metric in relation to Teraflop Performance.

LordOfChaos said:
Bolded
1) I did?
2) You've already undone your argument. The Gflops we talk about in console terms are a purely theoretical paper calculation of shader ALUs * 2 ops per core per clock * clock speed. By talking about optimization, you've already acknowledged this is far from all there is to it.

It also only looks at shader theoretical performance alone. The Xbox certainly has more. But that's not all there is on a GPU, as Cerny alluded to, several developers, DF, NX, and others have winked at. A 23% clock speed advantage also means everything else is clocked higher on the GPU, the command processors, the buffers, the caches, the coprocessors, there may be other advantages to the GPU outside of the peak shader theoretical performance console warriors like to boil things down to. That's without considering the geometry engine, API differences, etc.

A Teraflop is not a theoretical metric. A Teraflop is a concrete measurement of 4 trillion calculations. Period. To imply a teraflop in theory might perform 4 trillion calculation's a second is to imply it really isn't capable of 4 trillion calculations in most cases.

Also, a Teraflop does much more than shader performance. A single Teraflop is also indicative of pure polygon count. As you wrongly cited, "It also only looks at shader theoretical performance alone" And polygon count's are not shader performance - shader's are a derivative texturing/overlay methods that allow the artist's to do more with less and a Teraflop in fact effects all object's on screen not just shader performance.

Please see here where I predicted both Consoles performance in Teraflop's (and calculated/added CPU teraflop performance for both systems - which is exactly 1 teraflop with each CPU) exactly 2 year's ahead of their reveal using my expertise in computer science and cgi creation to learn more.

ethomaz · Oct 8, 2020

LordOfChaos said:
Going to send a correction to the Techpowerup GPU database guy, seems to be leading some astray

AMD Xbox Series X GPU Specs

AMD Scarlett, 1825 MHz, 3328 Cores, 208 TMUs, 64 ROPs, 10240 MB GDDR6, 1750 MHz, 320 bit

www.techpowerup.com

TMUs 208
ROPs 64

onQ123 · Oct 8, 2020

LordOfChaos said:
Going to send a correction to the Techpowerup GPU database guy, seems to be leading some astray

AMD Xbox Series X GPU Specs

AMD Scarlett, 1825 MHz, 3328 Cores, 208 TMUs, 64 ROPs, 10240 MB GDDR6, 1750 MHz, 320 bit

www.techpowerup.com

SMH they still haven't fixed that after the facts been out there

Lysandros · Oct 8, 2020

LordOfChaos said:
Going to send a correction to the Techpowerup GPU database guy, seems to be leading some astray

AMD Xbox Series X GPU Specs

AMD Scarlett, 1825 MHz, 3328 Cores, 208 TMUs, 64 ROPs, 10240 MB GDDR6, 1750 MHz, 320 bit

www.techpowerup.com

Information (hot chips slide) about the ROPs is at 15 min. "116 Gpix/sec", he should correct it. I dont know about the TMU's sadly.

LordOfChaos · Oct 8, 2020

Nhranaghacon said:
The correct contextual performance metric for a single teraflop is 4 trillion calculations a second.

Attempting the redefine/confuse the teraflop metric by citing CU count's are worthless even when it take's less CU's to produce more Teraflops. Teraflop's the metric that matters - CU count's will always effectively grow or shrink with architecture improvements - as needed - but should not be considered the relevant performance metric in relation to Teraflop Performance.

A Teraflop is not a theoretical metric. A Teraflop is a concrete measurement of 4 trillion calculations. Period. To imply a teraflop in theory might perform 4 trillion calculation's a second is to imply it really isn't capable of 4 trillion calculations in most cases.

By theoretical, I mean it's a calculated measure of the peak output of the system. You can't seriously say there's never any difference between a paper calculation of all of ALUs * clock speed * 2 ops per ALU per cycle, vs a graphics chips output in real world performance. Are there never differences in cache contention? Hit rates? Bandwidth? Schedulers? Keeping the ALU's filled with relevant non redundant work? THAT is the difference between a measurement of the peak flops and real end performance, otherwise why do you think performance between generations and architectures can't be compared purely by looking at peak flops?

Vega killed Nvidia, because more flops? Right? Right? No, because...No.

No one is saying one teraflop is a different number than what it is. What everyone DOES understand is a paper figure is one thing, applying the performance to real world graphics is another, and the difference between the two would be the utilization rate and various other factors.

Soodanim · Oct 8, 2020

Nhranaghacon said:
The correct contextual performance metric for a single teraflop is 4 trillion calculations a second.

Attempting the redefine/confuse the teraflop metric by citing CU count's are worthless even when it take's less CU's to produce more Teraflops. Teraflop's the metric that matters - CU count's will always effectively grow or shrink with architecture improvements - as needed - but should not be considered the relevant performance metric in relation to Teraflop Performance.

A Teraflop is not a theoretical metric. A Teraflop is a concrete measurement of 4 trillion calculations. Period. To imply a teraflop in theory might perform 4 trillion calculation's a second is to imply it really isn't capable of 4 trillion calculations in most cases.

Also, a Teraflop does much more than shader performance. A single Teraflop is also indicative of pure polygon count. As you wrongly cited, "It also only looks at shader theoretical performance alone" And polygon count's are not shader performance - shader's are a derivative texturing/overlay methods that allow the artist's to do more with less and a Teraflop in fact effects all object's on screen not just shader performance.

Please see here where I predicted both Consoles performance in Teraflop's (and calculated/added CPU teraflop performance for both systems - which is exactly 1 teraflop with each CPU) exactly 2 year's ahead of their reveal using my expertise in computer science and cgi creation to learn more.

That Reddit post in short: a graphics whore with terrible grammar got a job in the industry and made a ridiculously long post about polygons and trees to just say "MS is is smart and playing the long game and Sony are dumb and will lose because weaker".

theddub · Oct 8, 2020

Lysandros said:
XSX has 64 ROPs confirmed.

Can you please post the information confirming this, I've only heard it rumored but not confirmed.

Lysandros

Lysandros · Oct 8, 2020

theddub said:
Can you please post the information confirming this, I've only heard it rumored but not confirmed.

Lysandros

Already did, see my earlier post.

theddub · Oct 8, 2020

Lysandros said:
Already did, see my earlier post.

post number or link?

Lysandros

aries_71 · Oct 8, 2020

Birdo said:
Multiplat games will look almost identical next gen.

Makes me wonder if this will make DF redundant.

No, there will be always a shadow with a slightly lesser resolution on one machine than the other. Thousands of posts and hours of videos will be poured.

Lysandros · Oct 8, 2020

theddub said:
post number or link?

Lysandros

Lysandros said:
Information (hot chips slide) about the ROPs is at 15 min. "116 Gpix/sec", he should correct it. I dont know about the TMU's sadly.

Kssio_Aug · Oct 8, 2020

Jigsaah said:
These ACTUAL multi-plat game comparisons can't come soon enough. I am so fucking tired of speculation.

I have to agree with you. In the beginning the rumours and speculations are kinda interesting... but after a while it gets so damn tiring. I can't wait to see the real deal.

Thirty7ven · Oct 8, 2020

Maybe he meant Xbox warriors are going to be disappointed?

Everyone else knows what’s up.

theddub · Oct 8, 2020

Lysandros I didn't watch the Redtechgaming video(I'm working right now), however he, the video is a report, references Hot Chips and the Hot Chips Series X presentation ...transcript.... does NOT mention ROPS....so does RedTechGAming have unreleased, insider(or claiming insider) information or is he just guessing? It does NOT seem to be CONFIRMATION, but just a guess.

alabtrosMyster · Oct 8, 2020

Calm down people, no matter what shocked you in this empty statement.

It is empty.

thicc_girls_are_teh_best · Oct 8, 2020

Elog said:
There are some strange statements in the last few posts.

I do ever dare inquire whom's posts you could be referring to, dear sir?

Every additional CU to a GPU adds less output than the previously added CU. This is a mathematical fact unless a task can be parallelised to 100% (which no task can).

This is a reductionists viewpoint of viewing load scheduling for task deployment among the CUs. If the scale of the work is not predicated on a certain amount of unique data to be distributed among those blocks of CUs, then the advantage favors a system that can run more parts of the instruction in parallel than the one that runs them faster, because the latter predicates itself on previous parts of the data pipeline for the instruction to be calculated beforehand.

Computer science 101.

Secondly, under heavy load the bottle-neck for a GPU is most often the unified cache (L2 in AMDs language) which results in CUs either idling or doing redundant work (i.e. performs a task that is either no longer required or is done by another CU). This is a real problem and if you look around at solutions where people try to increase cache efficiency on GPUs you can see crazy numbers such as +50% in actual output in terms of computational performance.

Right, the L2$ that MS provided an extra MB's worth on the GPU. Now per CU if you divide it down that would still give Sony the advantage in L2$ size per CU, but you're talking ~ 15 KB difference CU-for-CU, and this still ignores that the larger GPU is going to have a larger block of the fastest L0$.

As for saturation, that point would hold true for both systems. You yourself have made some rather aggrandizing conclusions about specific GE and PS customizations regarding PS5 based a lot on hopeful wishing and a single graphics demo which you conflated as being a statement more than it was while not being aware the company which made that demo has a history of doing all of their demos on Sony platforms going back two decades. But I digress...

Your conclusions in that other thread, I pointed out the rather fanciful conclusions you were deducing so there's no reason to repeat it here. At the end of the day we should just acknowledge that the issue of saturation affects both platforms but you really have to ask yourself, if frontend saturation was so bad on GPUs going beyond your seemingly favored 36 CU preference...why are all of these companies making bigger GPUs in the first place!? Something tells me they know better on this concern of hardware saturation than yourself otherwise they would not be spending billions into parallelizing GPU workloads.

Actually when you think about it, the concern itself is invalid in the first place. Try applying this to the CPU space 15 years ago: by that logic, we should've simply kept cranking up the clocks. But engineers realized a reality when it comes to pushing clocks too high. At some point, parallelism wins out as a design metric. We are starting to see the benefits of that from AMD themselves, going with larger discrete GPUs, never mind Nvidia or Intel who are in similar pursuit (especially the latter). So I'd say market realities would indicate that these companies, including AMD, have addressed the vast majority of concerns regarding frontend saturation of their GPU resources. After all, you kind of need to have done that in order to further pursue things such as chiplets, which AMD are rumored to be doing for RDNA3 (no word from Nvidia on that front yet; both trail behind Intel in the chiplet area for GPUs however).

Since we lack data on the PS5's cache system it is hard to make a comparison. All we know is that Sony has spent some serious work in trying to increase cache efficiency. We will see if they have succeeded.

I'd say they have, for their design performance targets. Which would match up favorably with MS's design performance targets for their next-gen hardware. The issue comes when some folks (some who have done so in this very thread, even on this very same page) use this as a means of tribalistic bickering to say one approach is good only if the other one is trash.

They may not say it in such words, but their language and whatnot suggests as such.

However, as seen in the +50% number above good cache management can yield much better results if done right than just adding CUs. Please note that I am not claiming that the PS5 is getting +50% output from the CUs - it is just an example to make clear that proper cache management can yield results that are much more impressive than a few % in increased actual performance.

Okay...so, where does this suddenly mean only PS5 sees this "good cache management"? Again, it's binary thinking: A chose to go with X, so A cannot have also chosen Y. That type of stuff. No company is making their decisions that way, it's foolish to pretend that they are.

Both systems have smart cache management. We already know Series's GPU can snoop the CPU caches, and the inverse is true as well (but software-only in that case). Technically speaking, if the GPU can snoop the CPU caches, that is an analogous approach in cache management comparable to cache scrubbers. In many ways they're attempting to resolve the same issues when it comes to stale data in the GPU caches to sync with correct data in those caches while attempting to cut down on flushing the entire caches or taking a hit going back to main memory for the data to copy back into the caches.

So again, there's really very little "either/or" hard massive compromises on either system. They're taking smart approaches in every part of the pipeline borrowing bits from anything relevant. The sooner this is accepted as a common-sense conclusion I think a lot of the FUD WRT either would die off drastically. But until then, we'll just keep debating where we feel it's merited.

theddub said:
Lysandros I didn't watch the Redtechgaming video(I'm working right now), however he references Hot Chips and the Hot Chips Series X presentation ...transcript.... does NOT mention ROPS....so does RedTechGAming have unreleased, insider(or claiming insider) information or is he just guessing? It does NOT seem to be CONFIRMATION, but just a guess.

If that's the case he's probably going from some of the early Arden leaks, which IIRC mentioned 64 ROPs. Or was it a spec sheet?

Either way, it's probably accurate tbh; if you do the math 52 CUs * 2 IPC * 1.825 GHz * 64 ROPs = 12.147 TF, exactly what Series X is at. I'm not even sure if AMD have increased ROPs for forthcoming GPUs, that is something they may do for RDNA3 when they can shift to 5nm (gives them more density budget).

Regardless, some folks are trying to take the ROPs and exaggerate a certain platform having inefficiencies as a result of that, but they aren't really thinking about this stuff to deep to see how that thought process falls apart. After all, according to a lot of these same folks, if TFs are not indicative of absolute performance efficiency (and I agree: they aren't), why are we suddenly using a metric that directly contributes to that previously-agreed-to incomplete measure of performance, as a ding against a "particular" platform's performance efficiency capabilities?

People can't keep changing their standards and conditions at random and not expect to get scoped for it

Lysandros · Oct 8, 2020

theddub said:
Lysandros I didn't watch Rtechgaming video, however he references Hot Chips and the Hot Chips Series X presentation ...transcript.... does NOT mention ROPS....so does RedTechGAming have unreleased information or is he guessing?

Just google Xbox series X GPU Evolution and you'll see that slide yourself. No need to mention the 'ROPs', it says 116 Gpix/sec right in the microsoft presentation slide. The only way of having this number knowing the clock frequency is via 64 ROPs. It's known quantity since almost two months. Anything else?..

ethomaz · Oct 8, 2020

theddub said:
Lysandros I didn't watch the Redtechgaming video(I'm working right now), however he, the video is a report, references Hot Chips and the Hot Chips Series X presentation ...transcript.... does NOT mention ROPS....so does RedTechGAming have unreleased, insider(or claiming insider) information or is he just guessing? It does NOT seem to be CONFIRMATION, but just a guess.

MS own Hot Chips presentation confirmed it in August.

Romulus · Oct 8, 2020

I love how DF was supposedly a bunch of Xbox fanboys when the Xbox one X came out, but now they're uber Sony bots.

Lysandros · Oct 8, 2020

ethomaz said:
Hot Chips presentation confirmed it.

Thank you! I couldn't manage to add that image for some reason.

gamer82 · Oct 8, 2020

what does it mater , it's not a pc you can't upgrade it bar storage. who cares how many flops etc are you really going to lose any sleep over something you cant change.

all this talk about them be shills yes they may be for both , as long as you buy a console they are getting paid. either via patron, youtube whatever and no doubt freebies from the company's. i don't think they really care about being called shills or anything else for that they probably just have a good laugh about it.

if you want the best theres a thing called a pc and you will just have to come to terms not every game will come to pc so join the club.

geordiemp · Oct 8, 2020

thicc_girls_are_teh_best said:
Man, I can almost taste the facetiousness in this post...

Anyway, it's like I said earlier about the PS5 caches being "faster"; that's something that begins to factor in if the graphical tasks in question need a certain duration of cycles to where it actually becomes a factor. Otherwise, on a cycle-for-cycle basis, the larger GPU with the larger physical amount of cache is going to be able to crunch more data in parallel than the smaller GPU with a smaller physical array of cache.

I don't know what was being discussed before my reply, but everything I'm bringing up fits neatly into that discussion. It's contingent to it, it's pertinent. You don't get to determine something that fits relatively close in with what you were discussing prior (does not physical cache allocation on the L0$ level affect CU efficiency? I surely would think it does) just because it brings up a point you either didn't consider or, in light of being indicated, don't like.

Why would it not be possible to increase the L1$ size? These systems are at least on 7nm DUV Enhanced; even a few slight architectural changes here and there would allow for more budget to cache sizes. I'm not saying it's 100% a lock they did increase the L1$ size, just that it's premature to assume they did not, when they've already increased the L2$ amount.

Otherwise yes, it's true if the L1$ sizes are the same for both then Series X feeds more at a 20% reduced speed. But you won't be needing to access the L1$ frequently in the first place if you have more physical L0$ allowing for a higher amount of unique data to be retained in the absolute fastest cache pool available. And that's where, on same architectures, the larger GPU has a very clear advantage in; always have and always will (unless we're talking about GPUs of two different architectures where the smaller one has a much larger L1$, but that's not what we're dealing with here regards RDNA2. Only discrete GPUs I can think of doing this are some upcoming Intel Xe ones that are very L0$-happy).

It'd be really nice if we stopped confusing cache speed with cache bandwidth. IMO the former should pertain to overall data throughput measured in overall time (cycle) duration. The latter should pertain to single-cycle throughput, which is dependent on actual cache sizes. Assuming L1$ is the same, their bandwidth is the same and the speed advantage for a given graphical task getting crunched on the caches only starts showing a perceptible difference in favor for that with the faster clocks if a certain threshold of data processing for that task in the caches is done. We can apply this to the L0$ as well.

No, because if PC GPU benchmarks are anything to go by many, MANY reserve a chunk of the VRAM as just-in-case cache, even if the game isn't actually occupying the cache in that moment of time.

So you'd think smarter utilization of the VRAM by cutting down on the use of chunks of it as a cache would make better use of it...thankfully MS have developed things into XvA like SFS to enable that type of smarter utilization of a smaller VRAM budget. Sony has a great solution too; it's different to MS, but both are valid and make a few tradeoffs to hit their marks. At least regarding MS's, I don't think those tradeoffs are what you're highlighting here, going by extensive research into this.

You do realize the RAM still needs to hold the OS, CPU-bound tasks and audio data, correct? Realistically we're looking at 14 GB for everything outside of the OS reserve for PS5 (NX Gamer's brought up the whole idea of caching the data to the SSD before; not that it's a realistic option IMHO outside of some tertiary OS utilities seeing as how the vast bulk of critical OS tasks expect the speed and byte-level addressable granularity of volatile memory to work with), and if we're talking games with similar CPU and audio budgets on both platforms, at most you have 1 extra GB for the GPU on PS5 vs. Series X, but you sacrifice half a gig of RAM for CPU and audio-bound data.

Yes it does have a faster SSD but there's still a lot of aspects of the I/O data pathway that are apparently CPU-bound once the data is actually in RAM.

Glad we agree on this part.

Just because MS happens to have more TF performance doesn't mean they didn't aim for a balanced design target, either. This is a common misconception and comes from a binary mode of thinking, where everything's either a hard either/or. Console design is much more complicated than that.

Your talking as if the XSX is slow but wide, and 4 shader arrays its not that wide is it. Also the path L1 to L0 will be longer on the XSX on silicon.

The L1 and L2 sizes are very important and how they are arranged their speed and efficiency, as effective bandwidth = bandwidfth / Factor of cache misses

You will note that the pC RDNA2 parts leaked so far have 10 CU in each shader array, and for the 80 CU part it has 8 shader arrays of 10 CU.

So XSX is certainly the odd arrangement, I suspect they wanted 4 shader arrays because 4 server instances.

At hotchips MS said they could not reveal the L1 cache details or any process details when asked.

Lets see how it works eh ?

nani17 · Oct 8, 2020

For all the people claiming they were called shills for because of Xbox no it was because one guy claimed halo was ok and wasn't that bad at all it was just "lighting"

I wouldn't call that a shill just a fan not seeing what 90% of people say. Hence after the fact, the game was delayed because even they knew it was not up to par as a next-gen title. I believe it was the first time that I can recall people called them out on something.

Again just a bad review on what he saw

Thirty7ven · Oct 8, 2020

nani17 said:
For all the people claiming they were called shills for because of Xbox no it was because one guy claimed halo was ok and wasn't that bad at all it was just "lighting"

I wouldn't call that a shill just a fan not seeing what 90% of people say. Hence after the fact, the game was delayed because even they knew it was not up to par as a next-gen title. I believe it was the first time that I can recall people called them out on something.

Again just a bad review on what he saw

Alex knows what he’s doing. He’s a PlayStation hater, and went on a Halo/Xbox defense because like the rest of the shills, they didn’t want to believe what they were and he had to go on defense.

Spider-Man RT? Oh but of course it’s because the hardware is weak, comparable to a 2060S... Dude is a bitch.

PaintTinJr · Oct 8, 2020

ethomaz said:
It has the same amount of ROPs (64).
TMUs are tied to the CU so it indeed has more.

But I thought in the hotchips presentation for the XsX there was someone mentioning that it is either TMUs or RT calcs, but not both at the same time, am I remembering that correctly?

LordOfChaos · Oct 8, 2020

PaintTinJr said:
But I thought in the hotchips presentation for the XsX there was someone mentioning that it is either TMUs or RT calcs, but not both at the same time, am I remembering that correctly?

Yes, each TMU does either a texture operation or a BVH per cycle

ethomaz · Oct 8, 2020

PaintTinJr said:
But I thought in the hotchips presentation for the XsX there was someone mentioning that it is either TMUs or RT calcs, but not both at the same time, am I remembering that correctly?

It is a RDNA 2 thing... the Interceptions are part of the TMUs so they share the same silicon.... if you use for one task you can't use the same for another at the same time.

it is like Async Compute... if you are using the CU for render graphics you can't use it for Async Compute... only the non-used CUs can be used for Async Compute.

thicc_girls_are_teh_best · Oct 8, 2020

geordiemp said:
Your talking as if the XSX is slow but wide, and 4 shader arrays its not that wide is it. Also the path L1 to L0 will be longer on the XSX on silicon.

The L1 and L2 sizes are very important and how they are arranged their speed and efficiency, as effective bandwidth = bandwidfth / Factor of cache misses

You will note that the pC RDNA2 parts leaked so far have 10 CU in each shader array, and for the 80 CU part it has 8 shader arrays of 10 CU.

So XSX is certainly the odd arrangement, I suspect they wanted 4 shader arrays because 4 server instances.

At hotchips MS said they could not reveal the L1 cache details or any process details when asked.

Lets see how it works eh ?

Probably because that stuff is at the discretion of AMD to disclose, not MS.

L1$ to L0$ path would be longer for the CUs further down the shader array, not necessarily those at the top. That's where it'd seem features like GPU snooping of CPU cache were put in place to help mitigate: cache misses.

No doubt Series X's arrangement is an unusual one, and I'd agree 4 SAs were probably determined for the Azure server usage. But does any of this have some perceivable impact on system performance when it comes to the actual games? From what I've been addressing on this page and from what info we have out there (official, partially related to concepts in discussion etc.), I don't think so.

The problem comes in when people start trying to peg some of this stuff to imply one side is lying of their performance metrics, or that they can't "really" hit certain targets after all. These companies have some of the smartest engineers in the world working for them, they would know of shortcomings at a fraction of the time it takes any of us to even conceive of them, and worked around them.

Zathalus · Oct 8, 2020

geordiemp said:
We disagree, thats fine, ps4 pro had nearly everything in boost mode Sony were just covering the odd one, I cant think of any of top of my head.

XSX was not cut down in CU, and it was running DX code, emulating or abstract api is a fine line of definition as its the MS way of operation.

Emulating is just wrong, its X86 and DX12 so no, just no.

Saying CGN game code is different to RDNA game code is why that dev mocked, and so do I.

Ok whatever, Microsoft has already explained that its not taking advantage of any IPC gains, or any of the newer features of the XSX so if you want to spread FUD go right ahead.

Nhranaghacon · Oct 8, 2020

LordOfChaos said:
By theoretical, I mean it's a calculated measure of the peak output of the system. You can't seriously say there's never any difference between a paper calculation of all of ALUs * clock speed * 2 ops per ALU per cycle, vs a graphics chips output in real world performance. Are there never differences in cache contention? Hit rates? Bandwidth? Schedulers? Keeping the ALU's filled with relevant non redundant work? THAT is the difference between a measurement of the peak flops and real end performance, otherwise why do you think performance between generations and architectures can't be compared purely by looking at peak flops?

Vega killed Nvidia, because more flops? Right? Right? No, because...No.

No one is saying one teraflop is a different number than what it is. What everyone DOES understand is a paper figure is one thing, applying the performance to real world graphics is another, and the difference between the two would be the utilization rate and various other factors.

Yes and you've within your first statement and then later statement attempted to re-frame that a teraflop only theoretically performs 4 trillion calculations and the difference is in "real end performance".

A teraflop is currently always capable of 4 trillion calculation's per second "real end performance" no matter how you frame it.

What you are attempting to say is, applying 1 teraflop/4 trillion calculation's a second to real world performance isn't the issue when there are all these other overhead factors mitigating performance.

I understand that.

What I am saying to you is a peak performance rating in Teraflop's - Teraflops that can be used for AI/Objects on Screen/High Polygon Count's or as you wrongly cited - only shader performance - are in fact the only calculative REAL WORLD performance metric that matter's in relation to graphics. Period.

Overclock the cpu/gpu frequencies? Add another half teraflop/whole teraflop to the overall performance ceiling as that is the correct metric in which to measure real world performance.

Faster ssd added? Does not effect the overall polygon output or graphical real world performance. Better optimized ALU's added? Watch software optimizations without those roadblocks only bolster 4 trillion calculation's to 7 trillion calculations real world performance.

You are saying essentially, do not pay attention to Teraflop performance because.... ultimately software will only EVER INCREASE teraflop performance so Teraflop Performance has somehow lost's it's meaning in Real World - "Real End" performance.

And I am pointing out that is a fallacy.

I personally, in fact do compare performance between generation's/architectures by looking specifically at Teraflop performance - and then consider what these improved architectures mean for software performance IN RELATION TO TERAFLOP performance. Higher faster rated memory does often not effect teraflop performance directly - but as a computer scientist and overclocker for about 20 years now - faster memory does allow you to overclock the cpu and gain more Teraflop performance. Teraflops are real world, real "end" performance metrics. And this is what ALL SANE GAME DEVELOPERS AND COMPUTER SCIENTISTS look at when determining how much to simulate/and what type of game they can create aside from the benefit's of near instantaneous ssd loadtimes.

"Im going to make a 12 teraflop simulation - well that simulation will still work on older architectures that are 12 teraflops but these newer 12 Teraflop GPU's will certainly only become more powerful as The Law of Accelerating Returns allows a teraflop's importance to only climb in performance"

You are seriously saying "IF Teraflop's are the main performance factor - then why release updated architectures with the same amount of teraflops"

And as I cited originally - this is because through more efficient software optimization - the concrete metric of 1 Teraflop ='s 4 trillion calculations per second - on these new architecture's - mean's the significance of that single teraflop, the only true indicator of actual REAL WORLD PERFORMANCE - can be increased ad infinite through better software (just because hardware utilizes software optimization to improve Teraflop performance does not negate teraflop performance) once these faults in architecture have been corrected. Essentially making 4 trillion calculation's turn into 135 trillion calculations.

And this metric is the only metric that matter's in relation to graphical fidelity. All roads lead to Teraflop performance. Whether older architectures with the same amount of teraflops as newer architectures can utilize method's that only increase one metric - teraflop performance - is another discussion entirely.

martino · Oct 8, 2020

nani17 said:
For all the people claiming they were called shills for because of Xbox no it was because one guy claimed halo was ok and wasn't that bad at all it was just "lighting"

I wouldn't call that a shill just a fan not seeing what 90% of people say. Hence after the fact, the game was delayed because even they knew it was not up to par as a next-gen title. I believe it was the first time that I can recall people called them out on something.

Again just a bad review on what he saw

it's too early for this kind of revisionism and the other video about halo you're not taking into account for it to be somewhat true will not disappear too.
listen opinion they all agree to here :

"the at least that" is from the infamous one btw and then also complain about no rt of course.

Azurro · Oct 8, 2020

NXGamer said:
MS noted this and made the hot swap SSD which is liked a lot.

The info after AMD share it on the Infinity Cache and the PS5 "scrubbers" will likely not be a happy moment for some either.

Hi, I was wondering if you had a bit more detail, is there some important performance improvement due to the PS5 Cache scrubbers? Is it the same technology as the Infinity Cache?

LordOfChaos · Oct 8, 2020

Nhranaghacon said:
I personally, in fact do compare performance between generation's/architectures by looking specifically at Teraflop performance - and then consider what these improved architectures mean for software performance IN RELATION TO TERAFLOP performance. Higher faster rated memory does often not effect teraflop performance directly - but as a computer scientist and overclocker for about 20 years now - faster memory does allow you to overclock the cpu and gain more Teraflop performance. Teraflops are real world, real "end" performance metrics. And this is what ALL SANE GAME DEVELOPERS AND COMPUTER SCIENTISTS look at when determining how much to simulate/and what type of game they can create aside from the benefit's of near instantaneous ssd loadtimes.

You sure write a lot while saying little. If me calling ALUs shaders is your gochya, have it, I was around when they unified pipelines and it's a force of habit.

Let me focus on this as the meat of what you said. Real world, end performance. Yes. Those measurements are absolutely real. But that's not what console wanking numbers are, no one has measured the output of these things outside of closed doors yet. As in, what we're talking about is only the peak use of all hardware on paper.

And as I cited originally - this is because through more efficient software optimization - the concrete metric of 1 Teraflop ='s 4 trillion calculations per second - on these new architecture's

And it's not all software. When a pipeline stalls for work, or a cache miss, that increases the gap between what an architecture can do on paper, and what it really does. Why, in your words, did AMD architectures underperform Nvidia per flop, at least until the near future, if then? Why Nvidia's focus on command processors keeping the ALU utilization rate up, to historically the detriment of peak flops? Software is never going to magically use every flop available and jump past every architectural bottleneck, that's just magical thinking.

magnumpy · Oct 8, 2020

this is the worst thing that's ever happened

Elog · Oct 8, 2020

thicc_girls_are_teh_best said:
This is a reductionists viewpoint of viewing load scheduling for task deployment among the CUs. If the scale of the work is not predicated on a certain amount of unique data to be distributed among those blocks of CUs, then the advantage favors a system that can run more parts of the instruction in parallel than the one that runs them faster, because the latter predicates itself on previous parts of the data pipeline for the instruction to be calculated beforehand.

Computer science 101.

Actually when you think about it, the concern itself is invalid in the first place. Try applying this to the CPU space 15 years ago: by that logic, we should've simply kept cranking up the clocks. But engineers realized a reality when it comes to pushing clocks too high. At some point, parallelism wins out as a design metric. We are starting to see the benefits of that from AMD themselves, going with larger discrete GPUs, never mind Nvidia or Intel who are in similar pursuit (especially the latter). So I'd say market realities would indicate that these companies, including AMD, have addressed the vast majority of concerns regarding frontend saturation of their GPU resources. After all, you kind of need to have done that in order to further pursue things such as chiplets, which AMD are rumored to be doing for RDNA3 (no word from Nvidia on that front yet; both trail behind Intel in the chiplet area for GPUs however).

You do realize that you do not even contradict me with your statements

Of course there is value in adding cores. I simply pointed out that each additional core offers less of an performance boost than the previously added core. The performance discount increases with how hard a specific task is to parallelize. Fortunately, graphical pipeline tasks are relatively easy to parallelize. Let's us make an arbitrary example. Let's assume that a graphical task has 2% of workflow that cannot be parallelized (probably in the vicinity of the truth). Let us assume that each CU adds 1 TFLOPs in theoretical performance (ofc silly - just for simplicity)

First CU adds 1 TFLOP. Second CU adds 0.98 TFLOP.

Already here you can conclude that going wide is good. You would need to increase frequency by a whooping 98% to compete with adding one more CU - going wide is your choice.

Third CU would add 0.96 TFLOP....tenth CU would add 0.82 TFLOP.

At 10 CUs you would list your GPU as having a theoretical max of 10 TFLOPs in performance. However, that number can never be achieved. Even under perfect circumstances you will only get 9.15 TFLOPs due to that 2% of your GPU workflow cannot be parallelized even though your technology specifications list 10 TFLOPs. Most people do not realize this hard truth. The more CUs, the wider the gap is between that theoretical TFLOP number and your actual TFLOP performance.

Every manufacturer is sitting with these trade-offs. Adding CUs vs increasing frequency. Increasing frequency requires also faster memory and a better cooling solution - both adds cost. It is a complex equation.

Your assumption that cache management is well handled is very much contradicted by research papers in this field - there is a lot to get from better task distribution and cache management. Real GPU utilization is not even close to 100% during rendering as it is today regardless of what your hardware monitor is stating.

thicc_girls_are_teh_best said:
Okay...so, where does this suddenly mean only PS5 sees this "good cache management"? Again, it's binary thinking: A chose to go with X, so A cannot have also chosen Y. That type of stuff. No company is making their decisions that way, it's foolish to pretend that they are.

Both systems have smart cache management. We already know Series's GPU can snoop the CPU caches, and the inverse is true as well (but software-only in that case). Technically speaking, if the GPU can snoop the CPU caches, that is an analogous approach in cache management comparable to cache scrubbers. In many ways they're attempting to resolve the same issues when it comes to stale data in the GPU caches to sync with correct data in those caches while attempting to cut down on flushing the entire caches or taking a hit going back to main memory for the data to copy back into the caches.

So again, there's really very little "either/or" hard massive compromises on either system. They're taking smart approaches in every part of the pipeline borrowing bits from anything relevant. The sooner this is accepted as a common-sense conclusion I think a lot of the FUD WRT either would die off drastically. But until then, we'll just keep debating where we feel it's merited.

We know some things. We know that PS5 has hardware pieces to manage caches that the XSX does not have. Then we have indications (we need more data) that seemingly the PS5 has more L2 cache per CU than the XSX and unless the XSX has increased the L1 cache size compared to the standard RDNA2 design, the PS5 also have more L1 cache per CU compared with the XSX. L0 caches are most likely identical.

So there is data to indicate that the PS5 has a better cache management system than the XSX: Larger cache per CU at the L1 and L2 levels, hardware features to manage those caches and increased cache memory frequency. As you rightfully point out the XSX has a system to use the CPU caches as an 'L3' cache - however that is not the same as the cache scrubbers. Cache scrubbers keep track of what data that can be purged at the cache level to increase cache turnaround times. Using the CPU cache is a cache overflow system. Two different things.

What the XSX has to its advantage is the increased VRAM to cache memory bandwidth for the first 10GB. That is important and a clear advantage.

In reality, putting all this together, the PS5 will perform closer to the theoretical max TFLOP number that the XSX. How much closer? Only tests with applications that are optimized on both will tell the story. And as I have stated before, given the new GE in the PS5 I believe the first multi-plats will perform better on the XSX simply because of an underutilized GE on the PS5 - I might be wrong of course but that is what I assume.

Clear · Oct 8, 2020

Nhranaghacon said:
The correct contextual performance metric for a single teraflop is 4 trillion calculations a second.

Attempting the redefine/confuse the teraflop metric by citing CU count's are worthless even when it take's less CU's to produce more Teraflops. Teraflop's the metric that matters - CU count's will always effectively grow or shrink with architecture improvements - as needed - but should not be considered the relevant performance metric in relation to Teraflop Performance.

A Teraflop is not a theoretical metric. A Teraflop is a concrete measurement of 4 trillion calculations. Period. To imply a teraflop in theory might perform 4 trillion calculation's a second is to imply it really isn't capable of 4 trillion calculations in most cases.

Also, a Teraflop does much more than shader performance. A single Teraflop is also indicative of pure polygon count. As you wrongly cited, "It also only looks at shader theoretical performance alone" And polygon count's are not shader performance - shader's are a derivative texturing/overlay methods that allow the artist's to do more with less and a Teraflop in fact effects all object's on screen not just shader performance.

Please see here where I predicted both Consoles performance in Teraflop's (and calculated/added CPU teraflop performance for both systems - which is exactly 1 teraflop with each CPU) exactly 2 year's ahead of their reveal using my expertise in computer science and cgi creation to learn more.

Peak performance, by whatever metric you choose be it Tflops, triangles/sec, etc. Is always theoretical because it assumes full resource utilization at the time of measurement.

Its a rough measure at best because real-world workloads are never the same operation x million times, its a huge variety of things each with their own contingencies and propensity for bottle-necking/stalling. Hence you can have massive potential Teraflop counts but see less actual performance in practice because other parts of the pipeline are lowering occupancy.

There are real, practical benefits to trying to methodically eliminate inefficiencies across the board, versus simply ramping up raw capacity. The perception problem is that on PC's particularly, the overall system architecture is typically mostly homogenous especially for benchmarking purposes, allowing for variances to be tied down to a narrow set of metrics. Hence the fixation on rough stats like Tflop count as the "true" measure of performance, when all its actually demonstrating is that when all else-is-equal this parameter has the most pronounced impact.

P.S. Thanks for the laugh with that giant Reddit thing. Most outlandish fan-fic I've read I've read in awhile. MS business model is real simple: Sell subscriptions to as wide an audience as possible. The end.

geordiemp · Oct 8, 2020

Zathalus said:
Ok whatever, Microsoft has already explained that its not taking advantage of any IPC gains, or any of the newer features of the XSX so if you want to spread FUD go right ahead.

No, they said its not using rDNA2 features, such as VRS and mesh shading which has to be programmed for specfically. Or SFS for that matter. That is correct. If that is loosley an IPC gain, thats marketing I guess.

However anything that is done automatically when called on by the apis, as thats what they do, control the hardware and tell it what to do when a progam asks for a task.

Which things have I said that is made up ?

Just because it might go against your thoughts does not mean it is FUD as I have made nothing up.

Nhranaghacon · Oct 8, 2020

LordOfChaos said:
You sure write a lot while saying little. If me calling ALUs shaders is your gochya, have it, I was around when they unified pipelines and it's a force of habit.

Let me focus on this as the meat of what you said. Real world, end performance. Yes. Those measurements are absolutely real. But that's not what console wanking numbers are, no one has measured the output of these things outside of closed doors yet. As in, what we're talking about is only the peak use of all hardware on paper.

And as I cited originally - this is because through more efficient software optimization - the concrete metric of 1 Teraflop ='s 4 trillion calculations per second - on these new architecture's

And it's not all software. When a pipeline stalls for work, or a cache miss, that increases the gap between what an architecture can do on paper, and what it really does. Why, in your words, did AMD architectures underperform Nvidia per flop, at least until the near future, if then? Why Nvidia's focus on command processors keeping the ALU utilization rate up, to historically the detriment of peak flops? Software is never going to magically use every flop available and jump past every architectural bottleneck, that's just magical thinking.

AMD underperformed Nvidia due to architecture that did not utilize the software properly.

Teraflop performance is the only metric any of the improvement's (improvement's you've cited, more efficient ALU's in the architecture, less redundancy in the cache - ect) architecture improvements mean to fix/alleviate. Architecture improvement's do this solely by improving the ability to utilize A Teraflop more effectively. And these enhancements always subsequently do one thing - and that is improve the amount of calculation's a Teraflop is capable of.

Inferior architecture with equal Teraflops will always underperform Superior Architecture with equal Teraflop's as the software suddenly dictates how efficiently these Teraflops are utilized across both inferior and superior parts.

But Ultimately, both ecosystem's will be capable of the same amount of instructions - however the inferior architecture will not increase the amount of instruction's it can deal with.

So 12 teraflop's will remain capable of outputting 12 teraflops of Performance once brute force software optimization has fixed any percievable issues.

However that same architecture will not be able to provide INCREASED teraflop performance due to limitation's in architecture - meaning the superior Nvidia Part will eventually -
through the utilization of software optimization - be capable of rendering twice as many or - 24 teraflop's worth of instructions - due to it's superior architecture. The other inferior architecture - will simply be stuck delivering 12 Teraflop's of instruction's.

I'm going to highlight a flaw in what you've stated below and update it

"And it's not all software. When a pipeline stalls for work, or a cache miss, that increases the gap between what an architecture can do on paper, and what it really does."

to say

"If an architecture is inefficient and or broken - it suddenly render's the Teraflop metric an inaccurate for of performance and thus that metric should be ignored"

this is what you are really attempting to say here.

Yes I assure you - we are all aware that inferior architecture can render the teraflop useless -

What you are doing is wrongly inferring we should ignore teraflop performance due to possible perceived imperfections in architecture. Or in other words, that CU could be bad, game's don't even run with bad Compute Unit's without crashing - let's change the narrative and move away from teraflop performance and focus on other aspect's of the architecture!

As if all of these architecture improvement's do not in fact hinge on improving one thing - Teraflop performance.

alabtrosMyster · Oct 8, 2020

thicc_girls_are_teh_best said:
Computer science 101.

You should get reimbursed, here is an extract from some research paper on the subject:

We illustrate that many kernels scale in intuitive ways, such as those that scale directly with added computational capabilities or memory bandwidth. We also find a number of kernels that scale in non-obvious ways, such as losing performance when more processing units are added or plateauing as frequency and bandwidth are increased. In addition, we show that a number of current benchmark suites do not scale to modern GPU sizes, implying that either new benchmarks or new inputs are warranted.

Nhranaghacon · Oct 8, 2020

Clear said:
Peak performance, by whatever metric you choose be it Tflops, triangles/sec, etc. Is always theoretical because it assumes full resource utilization at the time of measurement.

Its a rough measure at best because real-world workloads are never the same operation x million times, its a huge variety of things each with their own contingencies and propensity for bottle-necking/stalling. Hence you can have massive potential Teraflop counts but see less actual performance in practice because other parts of the pipeline are lowering occupancy.

There are real, practical benefits to trying to methodically eliminate inefficiencies across the board, versus simply ramping up raw capacity. The perception problem is that on PC's particularly, the overall system architecture is typically mostly homogenous especially for benchmarking purposes, allowing for variances to be tied down to a narrow set of metrics. Hence the fixation on rough stats like Tflop count as the "true" measure of performance, when all its actually demonstrating is that when all else-is-equal this parameter has the most pronounced impact.

P.S. Thanks for the laugh with that giant Reddit thing. Most outlandish fan-fic I've read I've read in awhile. MS business model is real simple: Sell subscriptions to as wide an audience as possible. The end.

Teraflop's are not a rough measurement of performance - a 12 teraflop card will alway's be able to utilize 12 teraflop's of performance barring an existential error in hardware that in fact render's that part useless. Whereas a 12 Teraflop card with superior architecture will allow you to in fact improve Teraflop performance radically. Period.

And I'm not sure how exactly EXACTLY announcing the console spec's Accurately for both Xbox Series X and Playstation 5 - 2 whole years before they were announced -with the inclusion of also factoring in the CPU teraflop performance on top of GPU performance - is a fanfic thing unless... you're a hardcore digital foundry fan - *cringe* but that's your prerogative I suppose.

And this was written 2 years before gamepass was public. Include a mention of gamepass and cite the acquisition (not creation) of new IP's and I'd say it's pretty accurate

Zathalus · Oct 8, 2020

geordiemp said:
No, they said its not using rDNA2 features, such as VRS and mesh shading which has to be programmed for specfically. Or SFS for that matter.

Anything that is done automatically will be used when called on by the apis.

Which things have I sdaid that is made up ?

Just because it might go against your thoughts does not mean it is fUD, thats called giving up the debate.

Xbox Series X back-compat tested: up to double the performance in the most demanding games

UPDATE: Sometimes, we just can't leave well enough alone. I returned to Xbox Series X to take a look at more Xbox One a…

www.eurogamer.net

while Series X runs old games with full clocks, every compute unit and the full 12 teraflop of compute, it does so in compatibility mode - you aren't getting the considerable architectural performance boosts offered by the RDNA 2 architecture.

There may be the some consternation that Series X back-compat isn't a cure-all to all performance issues on all games, but again, this is the GPU running in compatibility mode, where it emulates the behaviour of the last generation Xbox - you aren't seeing the architectural improvements to performance from RDNA 2, which Microsoft says is 25 per cent to the better, teraflop to teraflop. And obviously, these games are not coded for RDNA 2 or Series X, meaning that access to the actual next-gen features like variable rate shading or mesh shaders simply does not exist.

"Compatibility mode", "emulates", and " not coded for RDNA 2", it literally can't get any clearer then that.

thicc_girls_are_teh_best · Oct 8, 2020

alabtrosMyster said:
You should get reimbursed, here is an extract from some research paper on the subject:

We illustrate that many kernels scale in intuitive ways, such as those that scale directly with added computational capabilities or memory bandwidth. We also find a number of kernels that scale in non-obvious ways, such as losing performance when more processing units are added or plateauing as frequency and bandwidth are increased. In addition, we show that a number of current benchmark suites do not scale to modern GPU sizes, implying that either new benchmarks or new inputs are warranted.

Wow, a whole one research paper! How many people are in this field, again? How long ago was this paper you written dated? Was it in reference to specific commercial GPUs (highly unlikely), or generic theoretical GPU models (more likely)?

Might want to consider these things before posting them carte blanche. Like I said, GPU saturation has always traditionally been a thing sought for resolving over the years, but today's cards from today's vendors have pushed closer to solving this particular issue than any generation of the past.

Eventually, GPUs will perfectly match the parallelized function of modern-day multi-core CPUs; right now they are on the cusp of it, some more than others. You would hope AMD are one of those at the edges of being there; after all that would benefit both pieces of plastic at the end of the day.

Elog said:
You do realize that you do not even contradict me with your statements

Of course there is value in adding cores. I simply pointed out that each additional core offers less of an performance boost than the previously added core. The performance discount increases with how hard a specific task is to parallelize. Fortunately, graphical pipeline tasks are relatively easy to parallelize. Let's us make an arbitrary example. Let's assume that a graphical task has 2% of workflow that cannot be parallelized (probably in the vicinity of the truth). Let us assume that each CU adds 1 TFLOPs in theoretical performance (ofc silly - just for simplicity)

First CU adds 1 TFLOP. Second CU adds 0.98 TFLOP.

Already here you can conclude that going wide is good. You would need to increase frequency by a whooping 98% to compete with adding one more CU - going wide is your choice.

Third CU would add 0.96 TFLOP....tenth CU would add 0.82 TFLOP.

Okay, we gotta stop right here, but this is literally not how GPUs work. Each added CU for an AMD card does not arbitrarily add less compute performance. Without factoring in clocks or ROPs or IPC, each CU has exactly the same amount of calculations capability.

Those other three things influence their peak TF performance, as a whole. However everything comes down to parallelization of the taskwork issued among the CUs of the GPU. GPUs are embarrassingly good at parallelized workloads, they are super-DSPs essentially, tuned for three-dimensional graphical data crunching.

At 10 CUs you would list your GPU as having a theoretical max of 10 TFLOPs in performance. However, that number can never be achieved. Even under perfect circumstances you will only get 9.15 TFLOPs due to that 2% of your GPU workflow cannot be parallelized even though your technology specifications list 10 TFLOPs. Most people do not realize this hard truth. The more CUs, the wider the gap is between that theoretical TFLOP number and your actual TFLOP performance.

I think most of us discussing this are well aware of that. That's why they're called TFLOPs; theoretical floating point operations per second. It's in the term itself. What we do know is that RDNA2 has much better frontend saturation and utilization than RDNA1, otherwise there would be no reason to push for larger GPUs.

Going by your logic, AMD should be listing Big Navi as only "really" capable of 2/3 of its TF claims. The truth is, performance per CU does not scale as linearly or as neatly as your example wants to frame it and it also ignores advances in GPU technology as well as AMD's own advances in the RDNA2 architecture design.

Every manufacturer is sitting with these trade-offs. Adding CUs vs increasing frequency. Increasing frequency requires also faster memory and a better cooling solution - both adds cost. It is a complex equation.

Your assumption that cache management is well handled is very much contradicted by research papers in this field - there is a lot to get from better task distribution and cache management. Real GPU utilization is not even close to 100% during rendering as it is today regardless of what your hardware monitor is stating.

You're also dealing with a gaming landscape (3P PC games) where nothing is coded to a specific GPU hardware spec, but rather a base range, with various things like high-res textures and effects to push performance on the top-tier cards. So not exactly a fair reference point here.

I even said in another post that a lot of games reserve a chunk of the VRAM as a cache; you can glean from that I would be in agreement with the general idea that cache management could be improved going forward. I just disagree with your implied notion only one of these two systems has an answer for it.

We know some things. We know that PS5 has hardware pieces to manage caches that the XSX does not have.

You mean like the cache scrubbers? Which aren't even necessarily the only way to selectively evict parts of data in the cache? Did you guys not read through the Series X Hot Chips presentation? The GPU can snoop the CPU caches, why do you think that ability is there if for no other reason than to aid in cache management and use of efficiency?

Then we have indications (we need more data) that seemingly the PS5 has more L2 cache per CU than the XSX and unless the XSX has increased the L1 cache size compared to the standard RDNA2 design, the PS5 also have more L1 cache per CU compared with the XSX. L0 caches are most likely identical.

More L2 cache per CU != more L2$. The way you guys are looking at this argument is dangerously close to how some try framing the "average bandwidth" numbers for Series X, weighing the fast and slow pools of GDDR6 as an even average, which would never be anywhere near the ratio in real-world use.

Framing this as a per-CU argument also contradicts your own insistence on GPU saturation issues you mentioned earlier; your arbitrary example isn't good enough to stave this off. There's nothing to state that frontend saturation suddenly arbitrarily jumps up on a smaller GPU just because it's smaller; it's not like the frontend is redesigned for one size of GPU vs. another.

The L0$ sizes per CU on both may be the same, but the Series X would have more due to virtue of having more CUs.

So there is data to indicate that the PS5 has a better cache management system than the XSX: Larger cache per CU at the L1 and L2 levels, hardware features to manage those caches and increased cache memory frequency

I've already illustrated how selectively cherry-picking the focus of L2$ cache to a per-CU instance when you would still have some of the same supposed frontend saturation issues on a smaller GPU as a larger one since the frontend design stays consistent between GPUs of the same architecture is erroneous, but I'll just leave it there to keep the reminder in place.

Also, again, this is offset through Series X having more L0$ in the GPU, which would have the shortest latency (1 cycle) out of any of the caches. You're assuming Series X has no hardware features for cache management because of no mention of cache scrubbers, but forget about GPU snooping. Speaking of...

. As you rightfully point out the XSX has a system to use the CPU caches as an 'L3' cache - however that is not the same as the cache scrubbers. Cache scrubbers keep track of what data that can be purged at the cache level to increase cache turnaround times. Using the CPU cache is a cache overflow system. Two different things.

1: Who said it was only in pertains to the CPU's L3$? Is that you assuming something just because?

2: I never said they were the same. I said they were analogous. And, I also stated that such a feature would not have been built in the system if it weren't aimed at increasing cache management.

When you combine either system's approaches here with the software/API-level features needed to ensure stability and effectiveness at the kernel level, at the end of the day we have to ask the question of if this is essentially meaningless cherry-picking? I'm almost certain both systems will have robust cache management features in place.

What the XSX has to its advantage is the increased VRAM to cache memory bandwidth for the first 10GB. That is important and a clear advantage.

In reality, putting all this together, the PS5 will perform closer to the theoretical max TFLOP number that the XSX.

This reads like an assumption pulled out the bum rather than anything that can be arrived at through sensible, neutral analysis of data and market realities. Both systems are using the same fundamental GPU architecture. Both have methods of ensuring the caches are being smartly managed. Both have their own choice customizations to their GPUs.

Insisting one drastically hits its TF peak vs. another shows that, for starters, you never really shifted away from a "power matters!" perspective on these discussions, just a more elegant way to frame that type of viewpoint. Secondly, it's usually done at the detriment of ignoring any of the design language and optimizations the other product has put out. And thirdly, it ignores various market realities.

How much closer? Only tests with applications that are optimized on both will tell the story. And as I have stated before, given the new GE in the PS5 I believe the first multi-plats will perform better on the XSX simply because of an underutilized GE on the PS5 - I might be wrong of course but that is what I assume.

You've been doing a lot of assuming, that's kind of part of the problem

Anyway, I've also touched on this in other replies up near the top, so not very interested to tread old ground. Multiplat performance will be generally close between the systems, but it won't be due to developers "not understanding the magic" of Sony's GE. That particular rumor was kicked around by MLID anyway, who is something of an ardent Sony fan when it comes to consoles, it tends to skew his console-based analysis and perspectives.

Some PS5 multiplats may run better in the launch phase, but that would probably be mainly due to the fact Gaemcore devkits for MS were "running behind", fitting some of the earlier rumors therein. But after that phase period I don't think it should be surprising to see Series X 3P games routinely running better, some exceptions here and there.

The question is what that range will be. For some, it'll probably be virtually imperceptible. For others, somewhat more noticeable. And for others still, it may be quite noticeable in a given area or two. But results are going to look quite crisp and polished on both systems regardless.

Elog · Oct 8, 2020

thicc_girls_are_teh_best said:
Wow, a whole one research paper! How many people are in this field, again? How long ago was this paper you written dated? Was it in reference to specific commercial GPUs (highly unlikely), or generic theoretical GPU models (more likely)?

Might want to consider these things before posting them carte blanche. Like I said, GPU saturation has always traditionally been a thing sought for resolving over the years, but today's cards from today's vendors have pushed closer to solving this particular issue than any generation of the past.

Eventually, GPUs will perfectly match the parallelized function of modern-day multi-core CPUs; right now they are on the cusp of it, some more than others. You would hope AMD are one of those at the edges of being there; after all that would benefit both pieces of plastic at the end of the day.

Seriously Thicc - he is 100% right. And even in the example of tasks that are relatively easy to parallelize Amdahl's law still applies (i.e. each additional computational unit adds less than the previous one as a function of how well the task(s) is suited for parallelization).

Nhranaghacon · Oct 8, 2020

Edit: I meant to say 2 whole year's before the consoles were announced and 3 months before gamepass was public - could not beat the edit timer.

Support NeoGAF

Digital Foundry's John: Been talking to developers, people will be pleasantly surprised with PS5's results

Member

Member

Gold Member

Gold Member

Member

Banned

Banned

Banned

Member

Member

Member

Gold Member

Banned

Member

Banned

Junior Member

Member

Member

Banned

Banned

Banned

Gold Member

Member

Banned

Member

Member

Member

Member

are in a big trouble

Banned

Member

Member

Banned

Gold Member

Member

Banned

Member

Banned

Member

Member

Member

CliffyB's Cock Holster

Member

Banned

Banned

Banned

Member

Gold Member

Member

Banned

Similar threads