Support NeoGAF

longdi · May 22, 2020

https://www.neogaf.com/threads/amd-hyping-amd-smartshift®.1543032/

cireza · May 22, 2020

thicc_girls_are_teh_best said:
...

Actually there is a hardware decompression block, I remembered reading something like that. So there is a component dedicated to uncompressing the data.

What is Xbox Series X Velocity Architecture?

Xbox Velocity Architecture promises no-compromise gameplay and cuts load times, drawing the best from Xbox Series X.

www.windowscentral.com

The hardware decompression block plays a vital role, allowing games to consume less space via compression on the SSD. That hardware is devoted to tackling run-time decompression, keeping games running smoothly without giving more work to the CPU. It uses Zlib, a general-purpose data-compression library, and a mysterious new system named "BCPack," geared to GPU textures.

funking giblet · May 22, 2020

Ascend said:
That question is irrelevant. It's like asking, where does the data in RAM go when you transfer it to the GPU? Yeah. Where does it go? That answer is the same in comparison to the case of the XSX transferring from SSD, because this is from the perspective that the system sees the mapped 100GB on the SSD as RAM, like has been postulated a gazillion times.

CPUs don't read anything directly, they read from RAM or an L1 -L3 cache. Cache on a CPU is not directly manageable. On a GPU it is, we'll get back to this.

The rest of what you said is nonsense.

Ascend said:
To me, it doesn't sound like you're disagreeing with her at all.

She's talking about the following.

Flushing cache. on SSD read. This really doesn't mean anything / is nonsense. Flushing a cache is Verboten on an actual working system unless you are in serious trouble.
She says basically cache misses are somehow ok because I know I will miss and read from RAM. Again, doesn't really make sense when the cache isn't the issue, accessing the RAM is..
Downplaying granular cache flushes as somehow not needed? Granular cache kills improve cache-hit ratio, improve performance. This person doesn't understand caching at all.

You never want to flush active cache,, unless everything has become invalid. Then you get a stampeding herd problem straight after.

The caches used by GPUs will not be loaded from SSD directly, they are usually done AFTER the data is in RAM and you've actually created a vertex buffer, or a shader and used the texture.
Some of these caches are measured in KB. What does an SSD give you here?

Typically, when removing items from a cache, it's hard to know what to target as you would need some sort of record and that record itself becomes a bottleneck. You can use some heuristics to target some regions of cache, and kill all entries in there, but you will be removing hot items from that cache, which isn't ideal, and will cause excess misses. A dedicated system for this which does it asynchronously, without a dev managing it is gold.

thicc_girls_are_teh_best · May 22, 2020

cireza said:
Actually there is a hardware decompression block, I remembered reading something like that. So there is a component dedicated to uncompressing the data.

What is Xbox Series X Velocity Architecture?

Xbox Velocity Architecture promises no-compromise gameplay and cuts load times, drawing the best from Xbox Series X.

www.windowscentral.com

Right. I'm just wondering if the GPU accesses the decompression block along with the CPU (which would probably mean they share the access path, similar to how they do on main system RAM), or if it's just the CPU that accesses it. Things like that, haven't been disclosed yet so we kinda don't have a way of knowing yet.

P Panajev2001a Yeah more or less; however I think what T Trueblakjedi was referring to wasn't the GPU accessing more than 10 GB of RAM at a time, but being able to access the 4 1 GB modules it exclusively has access to while the CPU accesses its 6 GB on the 2 GB modules (and while the CPU accesses along that, the GPU cannot access the other 6 GB of RAM).

I'm curious if that's a possible feature they've added in terms of bus access; 4 GB isn't a lot of RAM, but if even that can still be accessed while the other 6 GB is being accessed by CPU and other components like audio, that would help with some of the bus contention issues APUs inherently bring.

Vaztu · May 22, 2020

thicc_girls_are_teh_best said:
all comes down to what priorities they focused on in their design of the APUs

I'm skeptical about this, cause PS5 goes great lengths to cut IO bottlenecks and have a fast SSD. They would utilize this AMD feature surely.

Even Tim says it was efficient to load directly from SSD after decompression. All signs point to PS5 having this feature.

MCplayer · May 22, 2020

7:00 A 18GB/s or something SSD for PC, theres also 2 fast SSDs, probably won't be cheap

THE:MILKMAN · May 22, 2020

MCplayer (Master Chief) said:
7:00 A 18GB/s or something SSD for PC, theres also 2 fast SSDs, probably won't be cheap

It would be cute how wrong Linus is in respect to the PS5 storage architecture if it wasn't so annoying! I bet after calling out Tim he feels a right plonker.

This right here is a real-time demo why just saying it has a fast SSD just makes a fool out of even tech savvy people.

Trueblakjedi · May 22, 2020

thicc_girls_are_teh_best said:
Right. I'm just wondering if the GPU accesses the decompression block along with the CPU (which would probably mean they share the access path, similar to how they do on main system RAM), or if it's just the CPU that accesses it. Things like that, haven't been disclosed yet so we kinda don't have a way of knowing yet.

P Panajev2001a Yeah more or less; however I think what T Trueblakjedi was referring to wasn't the GPU accessing more than 10 GB of RAM at a time, but being able to access the 4 1 GB modules it exclusively has access to while the CPU accesses its 6 GB on the 2 GB modules (and while the CPU accesses along that, the GPU cannot access the other 6 GB of RAM).

I'm curious if that's a possible feature they've added in terms of bus access; 4 GB isn't a lot of RAM, but if even that can still be accessed while the other 6 GB is being accessed by CPU and other components like audio, that would help with some of the bus contention issues APUs inherently bring.

To clarify, the XSX memory is reported to be 6 2GB modules consisting of an upper and lower memory address (=12 GB) and 4x1 GB memory modules making 10 memory segments.

These 10 modules are accessible by 2 bidirectional 16-bit lanes per module (32 bit X 10 lanes = 320 bit bus). Here is where info gets a little murky about the architecture:

Those 16 bit lanes should be able to access all memory addresses to which they are attached.

But the way the architects describe it, the CPU is assigned access to 6 out of the 10 lanes. The CPU is reserved access to the upper memory address range of the 2GB modules only.

The lower 1GB memory address on those 2GB chips are reserved for GPU work (6GB).

The 4 1GB modules are reserved for the GPU at all times... So the lower 6 GB of the 2GB modules plus the 4X1 GB = 10GB of VRAM. The 2GB might be subject to contention depending upon whether accessing the upper 1GB of the 2Gb module uses both lanes at full speed (both 16 bit lanes @56GB/s) or half speed (1 16 bit lane at 28GB/s).

This setup is quite confusing because it seems that if the GPU were to use the full band access to all 10 modules, the entire system bandwidth is used and the CPU doesn't have access to the other 6 GB.

I can't find a permutation of accesses by CPU and GPU where they both can use their full bandwidth simultaneously. The most I could come up with is if the GPU took the full bandwidth of the 4x1gb chips (4X56 = 224 GB/s) plus using half the bandwidth to access the lower memory address of the 2GB chips (28*6 = 168GB/S) for 392 GB/s max bandwidth without contention. The CPU would be the consumer of the remainder bandwidth (168GB/s).

Sorry for the long post.

Trueblakjedi · May 22, 2020

funking giblet said:
CPUs don't read anything directly, they read from RAM or an L1 -L3 cache. Cache on a CPU is not directly manageable. On a GPU it is, we'll get back to this.

The rest of what you said is nonsense.

She's talking about the following.

Flushing cache. on SSD read. This really doesn't mean anything / is nonsense. Flushing a cache is Verboten on an actual working system unless you are in serious trouble.
She says basically cache misses are somehow ok because I know I will miss and read from RAM. Again, doesn't really make sense when the cache isn't the issue, accessing the RAM is..
Downplaying granular cache flushes as somehow not needed? Granular cache kills improve cache-hit ratio, improve performance. This person doesn't understand caching at all.

You never want to flush active cache,, unless everything has become invalid. Then you get a stampeding herd problem straight after.

The caches used by GPUs will not be loaded from SSD directly, they are usually done AFTER the data is in RAM and you've actually created a vertex buffer, or a shader and used the texture.
Some of these caches are measured in KB. What does an SSD give you here?

Typically, when removing items from a cache, it's hard to know what to target as you would need some sort of record and that record itself becomes a bottleneck. You can use some heuristics to target some regions of cache, and kill all entries in there, but you will be removing hot items from that cache, which isn't ideal, and will cause excess misses. A dedicated system for this which does it asynchronously, without a dev managing it is gold.

Thank you for your analysis. Can you describe the utility of the GPU cache scrubbers in this scenario?

funking giblet · May 22, 2020

Trueblakjedi said:
Thank you for your analysis. Can you describe the utility of the GPU cache scrubbers in this scenario?

Either indirectly requesting an asset be deleted directly without knowing the lookup in combination with an LRU when under memory pressure. If I can use indirection to request an asset be deleted I don't need to keep memory addresses around. Deleting an entry directly means I can keep good cache around so I don't get a bunch of misses. I imagine something similar to a consistent hash method from metadata about the asset or a hw lookup table similar to how hw PRT works

Trueblakjedi · May 22, 2020

funking giblet said:
Either indirectly requesting an asset be deleted directly without knowing the lookup in combination with an LRU when under memory pressure. If I can use indirection to request an asset be deleted I don't need to keep memory addresses around. Deleting an entry directly means I can keep good cache around so I don't get a bunch of misses. I imagine something similar to a consistent hash method from metadata about the asset or a hw lookup table similar to how hw PRT works

So if the GPU knows that the data resident in the cache isn't useful anymore then it can successfully flush with no penalty or instead of flushing, simply over write?

funking giblet · May 22, 2020

Trueblakjedi said:
So if the GPU knows that the data resident in the cache is useful anymore than it can successfully flush with no penalty or instead of flusing simply over write?

The gpu asks for the data or asks to remove it but the cache scrubbers know what to do and work in the background and other processes may have asked for a cachekill either. There will just suddenly be a cache miss, then a fallback to request the original asset which may or may not be in RAM either. This then overwrites the previous location. An LRU usually works to remove the oldest and least used entries to make room for new entries.

jimbojim · May 22, 2020

Journey said:
This time the die size difference is all about the CUs. XSX has 45% more CUs than the PS5, an even bigger factor than PS4 over XBO. We would be having a completely different conversation this coming gen had the clock frequencies been similar, but Sony increasing the clock was the only way they could compete with the power difference and it's one of the things you can adjust last minute, the other being ram, in fact it was expected for Sony to do this as an answer, I can confidently bet that it was never Sony's intention to go with clocks as high as 2.23Ghz from the beginning.

thicc_girls_are_teh_best said:
I do honestly think Sony were looking for at least a 2 GHz clock on the GPU, since they decided from the get-go for a 36 CU GPU and that meant they could only get desired performance with high clocks, with a big focus on the cooling system. Maybe they even planned for variable frequency much earlier on (they could not test that on Ariel though since it was an RDNA1 chip, and at least two of the early Oberon revisions were using a fixed frequency setup, Cerny seemed to have suggested this himself).

Journey said:
Oh I agree, they locked in at 36 CU way before with the intent of pushing frequencies to a considerable 2Ghz, the 9.2 Teraflop figure was no mistake, but I do believe that they pushed harder once MS revealed their 12 TF figure, they HAD to.

So, Sony pushed a GPU with variable 2.23 Ghz in 3 months just like that since MS confirmed 12 TF in December last year. Yeah, no! It doesn't make any sense what you said at all. Btw. they've pantented their cooling solution back in August last year. Doesn't mean if Github didn't have any data for 2.23 GHz, Sony pushed GPU to 2.23 GHz after that. It doesn't make any sense.

MasterCornholio · May 22, 2020

jimbojim said:
So, Sony pushed a GPU with variable 2.23 Ghz in 3 months just like that since MS confirmed 12 TF in December last year. Yeah, no! It doesn't make any sense what you said at all. Btw. they've pantented their cooling solution back in August last year. Doesn't mean if Github didn't have any data for 2.23 GHz, Sony pushed GPU to 2.23 GHz after that. It doesn't make any sense. AT ALL!

I'm not buying the "panicked Sony" theory either.

If the PS5 already had a cooling system that could handle those clocks then I don't see why they would go with lower clocks in the first place.

jimbojim · May 22, 2020

MasterCornholio said:
I'm not buying the "panicked Sony" theory either.

If the PS5 already had a cooling system that could handle those clocks then I don't see why they would go with lower clocks in the first place.

Every patent needs to be tested. It also needs a certifications. And that takes time.

MasterCornholio · May 22, 2020

jimbojim said:
Every patent needs to be tested. It also needs a certifications. And that takes time.

Pretty much. If they were to increase the clocks they would have to test the system to make sure it doesn't suffer from overheating issues. And if it does they either have to lower the clocks or change the systems design.

The only way that I can see them boosting the clocks at the last minute is if they overbuilt their cooling system. This is similar to what Microsoft did with the X1. However there were rumors of the systems overheating so that rumor contradicts this theory.

Ascend · May 22, 2020

funking giblet said:
CPUs don't read anything directly, they read from RAM or an L1 -L3 cache. Cache on a CPU is not directly manageable. On a GPU it is, we'll get back to this.

The rest of what you said is nonsense.

I didn't mention the CPU at any point here, so, I'm not sure why you're bringing it up. XVA seems to be targeted to feed specifically the GPU, not the CPU. And it will primarily be used for high quality mip pages.

funking giblet said:
Flushing cache. on SSD read. This really doesn't mean anything / is nonsense. Flushing a cache is Verboten on an actual working system unless you are in serious trouble.
She says basically cache misses are somehow ok because I know I will miss and read from RAM. Again, doesn't really make sense when the cache isn't the issue, accessing the RAM is..
Downplaying granular cache flushes as somehow not needed? Granular cache kills improve cache-hit ratio, improve performance. This person doesn't understand caching at all.

You never want to flush active cache,, unless everything has become invalid. Then you get a stampeding herd problem straight after.

I'm not sure you've read everything she says. I still get the sense that you two are saying the same thing. Let me quote the whole thing, just to make sure everyone is on the same page...

Regarding XVA vs PS5 I/O engine Each are designed in different ways, leading to systems that are better/worse at certain things PS5 is better for no processing, direct to RAM I/O and dumping raw data into RAM XVA is better when you need to process the data in question
And remember the PS5's cache scrubbers? These are a thing due to a weakness of the PS5 I/O engine that isn't an issue in the first place in XVA
When you overwrite data in RAM, it's possible that data is mirrored in the GPU cache. But it's also possible something else is in the cache and is not the data being overwritten.
The obvious solution would be flushing GPU caches when the SSD is read, that way no matter what the GPU doesn't get a cache miss (it knows the cache is clear and to ignore cache and look in RAM)
But (as Cerny says) this will really hurt GPU performance. Solution? Cache scubbers/Coherency engine
CE tells the cache scrubbers what part of RAM was overwritten, and they check the caches for the mirroring of said data, wiping it if found
XVA inherently avoids this issue by feeding CPU/GPU directly with SSD data. Since the data is either discarded by the CPU/GPU or written back to RAM, the CPU/GPU always know what data need be wiped from cache No need for scrubbers
This is just one of many advantages/disadvantages of the two console's I/O systems

funking giblet said:
The caches used by GPUs will not be loaded from SSD directly, they are usually done AFTER the data is in RAM and you've actually created a vertex buffer, or a shader and used the texture.
Some of these caches are measured in KB. What does an SSD give you here?

Yes it is usually done after the data is in RAM. The SSD would give you reduction in RAM requirements. Why do you specifically look at the smallest caches, rather than the largest?

funking giblet said:
Typically, when removing items from a cache, it's hard to know what to target as you would need some sort of record and that record itself becomes a bottleneck. You can use some heuristics to target some regions of cache, and kill all entries in there, but you will be removing hot items from that cache, which isn't ideal, and will cause excess misses. A dedicated system for this which does it asynchronously, without a dev managing it is gold.

That's exactly what the sampler feedback is for.

jimbojim · May 22, 2020

MasterCornholio said:
Pretty much. If they were to increase the clocks they would have to test the system to make sure it doesn't suffer from overheating issues. And if it does they either have to lower the clocks or change the systems design.

The only way that I can see them boosting the clocks at the last minute is if they overbuilt their cooling system. This is similar to what Microsoft did with the X1. However there were rumors of the systems overheating so that rumor contradicts this theory.

I think that rumor was being pushed by Jez Corden. Later he backpedalled. Why? Read the thread in spoiler ( it isn't large, few pages ). It's hilarious. You'll get it.

Rumor: The PS5 final console design has not been shown because Sony is trying to solve heating issues with the console

https://www.tweaktown.com/news/71606/playstation-5s-rumored-heat-issues-should-be-solved-in-final-console/index.html New reports suggest Sony is currently wrangling PS5 overheating issues. Supposed unnamed dev sources tell reporters like Windows Central's Dan Rubino and Jez Corden that the...

www.resetera.com

MasterCornholio · May 22, 2020

jimbojim said:
I think that rumor was being pushed by Jez Corden. Later he backpedalled. Why? Read the thread in spoiler ( it isn't large, few pages ). It's hilarious. You'll get it.

Rumor: The PS5 final console design has not been shown because Sony is trying to solve heating issues with the console

https://www.tweaktown.com/news/71606/playstation-5s-rumored-heat-issues-should-be-solved-in-final-console/index.html New reports suggest Sony is currently wrangling PS5 overheating issues. Supposed unnamed dev sources tell reporters like Windows Central's Dan Rubino and Jez Corden that the...

www.resetera.com

I personally believe that rumor was false. Jez Cordan is a journalist who is also a fanboy in my opinion. He lets his bias get in the way of his work. Not to mention he communicates and does pod casts with the most extreme of fanboys.

funking giblet · May 23, 2020

Ascend said:
I didn't mention the CPU at any point here, so, I'm not sure why you're bringing it up. XVA seems to be targeted to feed specifically the GPU, not the CPU. And it will primarily be used for high quality mip pages.

I'm not sure you've read everything she says. I still get the sense that you two are saying the same thing. Let me quote the whole thing, just to make sure everyone is on the same page...

Regarding XVA vs PS5 I/O engine Each are designed in different ways, leading to systems that are better/worse at certain things PS5 is better for no processing, direct to RAM I/O and dumping raw data into RAM XVA is better when you need to process the data in question

All data needs to be processed, You have HW for certain processes, software for others. Decompression on HW is something you want otherwise you have software doing it.
Any single decompression job you can think of, unless specially written for multicore decompression with synchronization between threads (this is rare) , is a single threaded operation. Kraken is one of these rare examples in this regard as you can use up to 2 cores in special situations. Otherwise you have the same level playing field. Zen2 processors doing "stuff".

Ascend said:
And remember the PS5's cache scrubbers? These are a thing due to a weakness of the PS5 I/O engine that isn't an issue in the first place in XVA

Cache invalidation is an issue in every system you can think of. These scrubbers are a pretty good solution. There are plenty of ways to handle cache invalidation though, but it still needs to be done. It's not an issue for XVA why? Well you ask about small caches, maybe you think we should store stack data on the SSD.

Ascend said:
When you overwrite data in RAM, it's possible that data is mirrored in the GPU cache. But it's also possible something else is in the cache and is not the data being overwritten.
The obvious solution would be flushing GPU caches when the SSD is read, that way no matter what the GPU doesn't get a cache miss (it knows the cache is clear and to ignore cache and look in RAM)

No, you do not flush cache,. Ignoring cache and hitting RAM is a negative, not a positive.. I think they think cache is fed from and SSD. If you miss cache, you hit RAM, or you read the SSD into RAM. then into your code, which then utilizes cache. You try to avoid this as much as possible. You do not flush cache on some arbitrary action like "reading an SSD". Remember they say "you flush cache and you knows it;s clear so you don't get a miss". That's actually a miss. The next few operations will have no cache and all stampede the RAM / SSD at once. (thundering herd problem, or distributed locking! yay)

Ascend said:
But (as Cerny says) this will really hurt GPU performance. Solution? Cache scubbers/Coherency engine
CE tells the cache scrubbers what part of RAM was overwritten, and they check the caches for the mirroring of said data, wiping it if found

Coherency is about managing cache in multiple locations, CE wouldn't tell the scrubber anything, if anything, it would ensure if L1 wrote to cache, L2 replicated it for subsequent reads, otherwise you get stale reads. The scrubber deletes cache. The scrubbers might use the CE to find locations of the cache, in case it does sit in multiple locations. Distributed cache is hard across multiple levels.

Ascend said:
XVA inherently avoids this issue by feeding CPU/GPU directly with SSD data. Since the data is either discarded by the CPU/GPU or written back to RAM, the CPU/GPU always know what data need be wiped from cache No need for scrubbers

No, the CPU is not reading registers from the SSD, does it write them back too? Does the SSD store stack and heap? How large is this cache table?

No this is what RAM is for. CPU / GPU executes instructions with data loaded onto a stack, then moves on to the next instruction. You are not writing to this.

Ascend said:
Yes it is usually done after the data is in RAM. The SSD would give you reduction in RAM requirements. Why do you specifically look at the smallest caches, rather than the largest?

This is how every CPU, CD-ROM or any device that accesses memory works, in the world. The smaller cache is orders of magnitudes faster. A typical single operation uses a few bytes of memory.

Ascend said:
That's exactly what the sampler feedback is for.

No, sampler feedback is to tell the engine, or API in this case, what are the next textures / MIP levels to fetch, based on current RAM residency and the need of the engine, and maybe delivering less than the engine request based on previous request and use of said textures. Absolutely nothing to do with cache.. It's kind of a weird understanding you have, when you have 96kb of cache in these things for L1. what do you imagine an SSD is doing. Is it just imagination?

I have no dog in this fight, both consoles are awesome, The XSX will kill the PS5 in framerates, and most of the benefits will be seen in first party's for either console.
I'm interested in tech, I work as a principal architect in a company building massively distributed systems for the last decade, and I have built games as a hobby for the 360 and PC. I don't pretend to know everything, but the stuff we are talking about, really has nothing to do with games, but typical architectures of any scalable system. I really feel people are looking for some magic bullet here to push their preferred massive corporations new product because they have decided to buy it. That twitter account is prob a gfx enthusiast college student who wrote some VB and has a github account with 0 code in it ( as far I saw, that was the case ). I commend their enthusiasm, I do not appreciate as matter of fact nonsense though.

If you want to go into any of these concepts, we don't even need to use video games as a jump off point.

semicool · May 23, 2020

funking giblet said:
All data needs to be processed, You have HW for certain processes, software for others. Decompression on HW is something you want otherwise you have software doing it.
Any single decompression job you can think of, unless specially written for multicore decompression with synchronization between threads (this is rare) , is a single threaded operation. Kraken is one of these rare examples in this regard as you can use up to 2 cores in special situations. Otherwise you have the same level playing field. Zen2 processors doing "stuff".

Cache invalidation is an issue in every system you can think of. These scrubbers are a pretty good solution. There are plenty of ways to handle cache invalidation though, but it still needs to be done. It's not an issue for XVA why? Well you ask about small caches, maybe you think we should store stack data on the SSD.

No, you do not flush cache,. Ignoring cache and hitting RAM is a negative, not a positive.. I think they think cache is fed from and SSD. If you miss cache, you hit RAM, or you read the SSD into RAM. then into your code, which then utilizes cache. You try to avoid this as much as possible. You do not flush cache on some arbitrary action like "reading an SSD". Remember they say "you flush cache and you knows it;s clear so you don't get a miss". That's actually a miss. The next few operations will have no cache and all stampede the RAM / SSD at once. (thundering herd problem, or distributed locking! yay)

Coherency is about managing cache in multiple locations, CE wouldn't tell the scrubber anything, if anything, it would ensure if L1 wrote to cache, L2 replicated it for subsequent reads, otherwise you get stale reads. The scrubber deletes cache. The scrubbers might use the CE to find locations of the cache, in case it does sit in multiple locations. Distributed cache is hard across multiple levels.

No, the CPU is not reading registers from the SSD, does it write them back too? Does the SSD store stack and heap? How large is this cache table?

No this is what RAM is for. CPU / GPU executes instructions with data loaded onto a stack, then moves on to the next instruction. You are not writing to this.

This is how every CPU, CD-ROM or any device that accesses memory works, in the world. The smaller cache is orders of magnitudes faster. A typical single operation uses a few bytes of memory.

No, sampler feedback is to tell the engine, or API in this case, what are the next textures / MIP levels to fetch, based on current RAM residency and the need of the engine, and maybe delivering less than the engine request based on previous request and use of said textures. Absolutely nothing to do with cache.. It's kind of a weird understanding you have, when you have 96kb of cache in these things for L1. what do you imagine an SSD is doing. Is it just imagination?

I have no dog in this fight, both consoles are awesome, The XSX will kill the PS5 in framerates, and most of the benefits will be seen in first party's for either console.
I'm interested in tech, I work as a principal architect in a company building massively distributed systems for the last decade, and I have built games as a hobby for the 360 and PC. I don't pretend to know everything, but the stuff we are talking about, really has nothing to do with games, but typical architectures of any scalable system. I really feel people are looking for some magic bullet here to push their preferred massive corporations new product because they have decided to buy it. That twitter account is prob a gfx enthusiast college student who wrote some VB and has a github account with 0 code in it ( as far I saw, that was the case ). I commend their enthusiasm, I do not appreciate as matter of fact nonsense though.

If you want to go into any of these concepts, we don't even need to use video games as a jump off point.

From "Ronaldo8" on beyond3d.

Begin quote

There seems to be a lot of misconceptions about the xbox velocity architecture. The goal of the PS5's and the Series X's I/O implementation is to increase the complexity of the content presented on screen without a corresponding increase in load times/memory footprint but go about it in totally different ways. Since the end of the cartridge era, an increase in geometry/texture complexity was usually accompanied by an increase in load times. This was because while RAM bandwidth might be adequate, the thoughput of the link feeding the RAM from the HDD was not. Hence, HDDs and the associated I/O architecture was the bottleneck.
One way to address this issue was to "cache" as much as possible in the RAM so as to get around the aforementioned bottleneck. However, this solution comes with its own problem in that the memory footprint just kept ballooning ("MOAR RAM"). This is brilliantly explained by Mark Cerny in his GDC presentation with the 30s of gameplay paradigm. Playstations answer to this problem is to increase the throughput to the RAM in an unprecedented way. Thus, instead of caching for the next 30s of gameplay, you might only need to cache for only the next 1s of gameplay which results in a drastic reduction in memory print. Indeed, the point of it all is that for a system with the old HDD architecture to have the same jump in texture and geometry complexity, either the amount of RAM needed for caching will have to be exorbitant or frametime will have to be increased to allow enough time for the texture to stream in (low framerates) or gameplay design will have to be changed to allow for texture loading (long load times). The PS5 supposedly will achieve all of this with none of those drawbacks thanks to alleviating the bottleneck between persistent memory and RAM (the bottleneck still exists because RAM is still quicker than the SSD but it is good enough for the PS5 rendering capacity and hence doesn't matter anyway. You just don't load textures from SSD to the screen.)

We can now see why the throughput from the SSD to RAM has now become the one-and-only metric for judging the I/O capability of next-gen systems in the mind of gamers. After all, it does make perfect sense. BUT...is there an alternative way of doing things? Microsoft's went in a completely different direction. Is the Persistent memory to RAM throughput still the bottleneck? Yes! Why is more throughput needed? To stream more textures evidently. The defining question is then how much of it is actually needed? After careful research by assessing how games actually utilised textures on a per frame basis, MS seems to have come to a surprising answer: not that much actually.

Indeed, by loading higher detailed MIPs than necessary and keeping the Main memory - RAM throughput constant, load times/memory footprint is increased. Lets quote Andrew Goosen in the Eurogamer deep-dive for reference:

"We observed that typically, only a small percentage of memory loaded by games was ever accessed," reveals Goossen. "This wastage comes principally from the textures. Textures are universally the biggest consumers of memory for games. However, only a fraction of the memory for each texture is typically accessed by the GPU during the scene. For example, the largest mip of a 4K texture is eight megabytes and often more, but typically only a small portion of that mip is visible in the scene and so only that small portion really needs to be read by the GPU."

The upshot of it all is that by knowing what MIP levels are actually needed on a per-frame basis and loading only that, the amount needed to be streamed is radically reduced and so is the throughput requirement of the SSD-RAM link as well as the RAM footprint. Can this Just-in time streaming solution be implement ed via software? MS indeed acknowledges that it is possible to do so but concedes that it is very inaccurate and requires changes to shader/application code. The hardware implementation of determining residency maps associated with partially resident textures is sampler feedback.

While sampler feedback is great, it is not sampler feedback streaming. You now need hardware implementation for :

(1) transition from a lower MIP-level to a higher one seamlessly
(2) fallback to a lower MIP-level if the requested one is not yet resident in memory and to blend back to the higher one when it comes available after a few frames.

Microsoft claims to have devised a hardware implementation for doing just that. This is the so-called "texture filters" described by James Stanard. Do we have more information about Microsoft's implementation? Of course we do. SFS is patented hardware technologgy and is described in patent US10388058B2 titled
"Texture residency hardware enhancements for graphics processors" with co-inventors being Mark S Grossman and....Andrew Goosen.

Combined with Directstorage (presuambly a new API that revamps the file system but information about it is sparse) and the constant high throughput of the SSD, this is how Microsoft claims to achieve 2x-3x increase in efficiency. Hence, the "brute force" meme about the series X is wildly off-base.

As for which of the PS5 or Series X I/O system is better? I say let the DF face-offs begin.

End quote

New quote(starts with quoting ShiftyGeezer)

I will quote your own thoughts on the matter as response (from the UE5 thread):

"The moment the data is arranged this way, we can see how virtualised textures would also apply conceptually to the geometry in a 2D array, along with how compression can change from having to crunch 3D data. You don't need to load the whole texture to show the model, but only the pieces of it that are viewable, which is the same problem as picking which texture tiles with virtual texturing.

Very clever stuff."

Ronaldo8:

The Unreal Engine team has a devised a software solution for a problem that Microsoft has resolved in hardware.

But sampler feedback in truth answers two questions:
(1) What MIP level was utimately sampled (LOD problem). What MIP level to load next
(2) Where exactly in the resource was it sampled (which tiles was sampled). This is based on what's visible to the camera. Basically what MIP to load next.

SFS is the streaming of only visible assets at the correct level of details. So yeah, software implementation of a solution already found in hardware.

End new quote

Read this article:

Coming to DirectX 12— Sampler Feedback: some useful once-hidden data, unlocked - DirectX Developer Blog

Why Feedback: A Streaming Scenario Suppose you are shading a complicated 3D scene. The camera moves swiftly throughout the scene, causing some objects to be moved into different levels of detail. Since you need to aggressively optimize for memory, you bind resources to cope with the demand for...

devblogs.microsoft.com

And one more quote from Scott_Arm

Start quote

Microsoft's solution is virtual texturing with sampler feedback for accurate mip and tile selection, plus some hardware filters to blend from a low resolution mip and a high resolution mip in case the high resolution mip is not loaded in time for the current frame. So they have some guarantee of the low quality mip arriving on time and then blend to the high quality of it's late so you don't notice pop in. It should be overall more efficient in making sure they don't waste memory on pages they don't need

End quote

Beyond3d thread:

Xbox Series X [XBSX] [Release November 10 2020]

Yes, like I said earlier, I'd be surprised if they'd actually limit the speed to that, since the controller they're using goes up to 3.75GB/s (both read & write speeds) It depends what speed memory they use within the SSD too. The controller just defines the maximum memory speed that it can...

forum.beyond3d.com

Ascend · May 23, 2020

funking giblet said:
All data needs to be processed, You have HW for certain processes, software for others. Decompression on HW is something you want otherwise you have software doing it.
Any single decompression job you can think of, unless specially written for multicore decompression with synchronization between threads (this is rare) , is a single threaded operation. Kraken is one of these rare examples in this regard as you can use up to 2 cores in special situations. Otherwise you have the same level playing field. Zen2 processors doing "stuff".

Cache invalidation is an issue in every system you can think of. These scrubbers are a pretty good solution. There are plenty of ways to handle cache invalidation though, but it still needs to be done. It's not an issue for XVA why? Well you ask about small caches, maybe you think we should store stack data on the SSD.

No, you do not flush cache,. Ignoring cache and hitting RAM is a negative, not a positive.. I think they think cache is fed from and SSD. If you miss cache, you hit RAM, or you read the SSD into RAM. then into your code, which then utilizes cache. You try to avoid this as much as possible. You do not flush cache on some arbitrary action like "reading an SSD". Remember they say "you flush cache and you knows it;s clear so you don't get a miss". That's actually a miss. The next few operations will have no cache and all stampede the RAM / SSD at once. (thundering herd problem, or distributed locking! yay)

Coherency is about managing cache in multiple locations, CE wouldn't tell the scrubber anything, if anything, it would ensure if L1 wrote to cache, L2 replicated it for subsequent reads, otherwise you get stale reads. The scrubber deletes cache. The scrubbers might use the CE to find locations of the cache, in case it does sit in multiple locations. Distributed cache is hard across multiple levels.

I posted the tweets in sequential order so you could get the whole picture, not pick them apart individually. The part about not flushing cache because it tanks performance is true, and she basically says that too at #5.

As for the coherency engine not telling the scrubber anything, how would the scrubber know what to delete if the coherency engine is not managing it? Even if the coherency engine is about managing cache in multiple locations, it can't properly manage it if the scrubber can delete what it wants. Whether the scrubber uses the CE or the CE commands the scrubber, the end result is the same. But they have to work together somehow. I think her explanation of the CE and scrubbing is more poor wording than a lack of understanding.

funking giblet said:
No, the CPU is not reading registers from the SSD, does it write them back too? Does the SSD store stack and heap? How large is this cache table?

No this is what RAM is for. CPU / GPU executes instructions with data loaded onto a stack, then moves on to the next instruction. You are not writing to this.

You're still thinking from the perspective of the traditional setup. What happens if the system sees 100GB of the SSD as RAM, having a total ram pool of 116GB?

funking giblet said:
This is how every CPU, CD-ROM or any device that accesses memory works, in the world. The smaller cache is orders of magnitudes faster. A typical single operation uses a few bytes of memory.

Yes, but if you're going to feed the GPU from SSD, you're not going to do it directly to the smallest cache, obviously. You don't do that from RAM either.

funking giblet said:
No, sampler feedback is to tell the engine, or API in this case, what are the next textures / MIP levels to fetch, based on current RAM residency and the need of the engine, and maybe delivering less than the engine request based on previous request and use of said textures. Absolutely nothing to do with cache.. It's kind of a weird understanding you have, when you have 96kb of cache in these things for L1. what do you imagine an SSD is doing. Is it just imagination?

Again, you're still seeing the SSD as storage rather than as extended RAM. Additionally, RAM is technically just a higher level of cache, being larger and slower.

funking giblet said:
I have no dog in this fight, both consoles are awesome, The XSX will kill the PS5 in framerates, and most of the benefits will be seen in first party's for either console.
I'm interested in tech, I work as a principal architect in a company building massively distributed systems for the last decade, and I have built games as a hobby for the 360 and PC. I don't pretend to know everything, but the stuff we are talking about, really has nothing to do with games, but typical architectures of any scalable system. I really feel people are looking for some magic bullet here to push their preferred massive corporations new product because they have decided to buy it. That twitter account is prob a gfx enthusiast college student who wrote some VB and has a github account with 0 code in it ( as far I saw, that was the case ). I commend their enthusiasm, I do not appreciate as matter of fact nonsense though.

If you want to go into any of these concepts, we don't even need to use video games as a jump off point.

I'm interested in tech as well. The PS5 SSD has been talked about so much that we mostly understand it. XVA is another story. I'm focusing on trying to understand it. Some people confuse that with hyping up the Xbox.

funking giblet · May 23, 2020

Not sure what that quote wall was related to, but I have talked about how virtual texturing already uses feedback buffers to optimize fetches in this case.. SFS is when you don't have the buffers in software, and use drivers / hw to make decisions for you, and it can improve things (and it can improve software solutions, because hey, it's hardware). There were two issues with software virtual texturing that hw solved. Page index tables storage and lookups, and filtering. Software filtering of pages resulted in less than optimal filtering between tiles from different pages, especially for high levels of anisotropic filtering, as these usually use large sampling kernels that might split across pages and mip levels., so need to pull adjacent pages, or just do a sub-optimal job. Hardware for this has shown up in gfx cards recently, and I guess MS have packages a hw solution as part of their XVA concept.

funking giblet · May 23, 2020

Ascend said:
I posted the tweets in sequential order so you could get the whole picture, not pick them apart individually. The part about not flushing cache because it tanks performance is true, and she basically says that too at #5.

As for the coherency engine not telling the scrubber anything, how would the scrubber know what to delete if the coherency engine is not managing it? Even if the coherency engine is about managing cache in multiple locations, it can't properly manage it if the scrubber can delete what it wants. Whether the scrubber uses the CE or the CE commands the scrubber, the end result is the same. But they have to work together somehow. I think her explanation of the CE and scrubbing is more poor wording than a lack of understanding.

You're still thinking from the perspective of the traditional setup. What happens if the system sees 100GB of the SSD as RAM, having a total ram pool of 116GB?

That's virtual memory, you can use mmio to reference a page of the 100GB directly, or it will swap out the page when you access outside of the resident memory. This isn't new, any database you can think of does the same, or you can use memory mapped files for massive files.

Ascend said:
Yes, but if you're going to feed the GPU from SSD, you're not going to do it to the smallest cache, obviously.

You read through the cache, miss, hit the RAM then hit the cache with whatever the instruction / data was. You rarely prewarm caches in such a volatile environment.

Ascend said:
Again, you're still seeing the SSD as storage rather than as extended RAM. Additionally, RAM is technically just a higher level of cache, being larger and slower.

You are right, in that it's a higher level, but I understand virtual memory very well. That's all this is. PS5 supports mapping the entire drive so....

oldergamer · May 23, 2020

This is interesting. I really want to see ms demo or explain this more.

Ascend · May 23, 2020

funking giblet said:
That's virtual memory, you can use mmio to reference a page of the 100GB directly, or it will swap out the page when you access outside of the resident memory. This isn't new, any database you can think of does the same.

Ok... And... You can't think of that having any benefit on the XSX, particularly due to the SSD being used as virtual memory rather than an HDD?

funking giblet said:
You read through the cache, miss, hit the RAM then hit the cache with whatever the instruction / data was. You rarely prewarm caches in such a volatile environment.

Agreed. But then again, the XSX is a bit complicated here, because its memory setup itself is also a bit peculiar, not counting any SSD virtual memory.

funking giblet said:
You are right, in that it's a higher level, but I understand virtual memory very well. That's all this is. PS5 supports mapping the entire drive so....

Well yeah. But that's kind of overkill, don't you think? No game is going to use the entire drive. At least I hope not lol.
Since you understand virtual memory, let's imagine this scenario...

Your GPU needs a certain high level mip. It reads through the different levels of cache, misses, 'arrives' at RAM, and in the 16GB RAM pool there is physically only the low level mip. However, the SSD is virtual RAM, and obviously the high level MIP is there, which means the GPU thinks it is available in RAM. How does the transfer of that high level mip take place?

funking giblet · May 23, 2020

Ascend said:
Ok... And... You can't think of that having any benefit on the XSX, particularly due to the SSD being used as virtual memory rather than an HDD?

Agreed. But then again, the XSX is a bit complicated here, because its memory setup itself is also a bit peculiar, not counting any SSD virtual memory.

Well yeah. But that's kind of overkill, don't you think? No game is going to use the entire drive. At least I hope not lol.
Since you understand virtual memory, let's imagine this scenario...

Your GPU needs a certain high level mip. It reads through the different levels of cache, misses, 'arrives' at RAM, and in the 16GB RAM pool there is physically only the low level mip. However, the SSD is virtual RAM, and obviously the high level MIP is there, which means the GPU thinks it is available in RAM. How does the transfer of that high level mip take place?

This is paging. This is something your phone does. The XSX is not doing anything any other machine cannot do. The PS5 supports this. This is typically slow on most machines. Do you see why these new machines benefit from their SSDs, and in particular... latency? Especially pulling in a small file!

Ascend · May 23, 2020

funking giblet said:
This is paging. This is something your phone does. The XSX is not doing anything any other machine cannot do. The PS5 supports this. This is typically slow on most machines. Do you see why these new machines benefit from their SSDs, and in particular... latency? Especially pulling in a small file!

Let me ask the same question another way. If the system cannot differentiate between the virtual memory and the actual RAM, would the required high level mip in the scenario above be transferred as;

a) SSD -> RAM -> GPU
or
b) SSD -> GPU

funking giblet · May 23, 2020

Ascend said:
Let me ask the same question another way. If the system cannot differentiate between the virtual memory and the actual RAM, would the required high level mip in the scenario above be transferred as;

a) SSD -> RAM -> GPU
or
b) SSD -> GPU

a) Swapped into RAM (which in both of these consoles cases, is the GPU RAM, although with XSX you have two options, slow and fast).
Both will use different paths to get there (one checked in view pure HW path, one Staged via HW then CPU).

For the caller, they do not know if the data is resident, the system does though via a lookup table from Virtual Memory into Physical, (whereas a miss may mean swapping in from disk).

Ascend · May 23, 2020

funking giblet said:
a) Swapped into RAM (which in both of these consoles cases, is the GPU RAM, although with XSX you have two options, slow and fast).
Both will use different paths to get there (one checked in view pure HW path, one Staged via HW then CPU).

For the caller, they do not know if the data is resident, the system does though via a lookup table from Virtual Memory into Physical, (whereas a miss may mean swapping in from disk).

Well, if the XSX works like that too, we're back at square 1 on trying to figure out what the velocity architecture actually is.

Deleted member 775630 · May 23, 2020

darrenskywalker said:
From "Ronaldo8" on beyond3d.

Begin quote

There seems to be a lot of misconceptions about the xbox velocity architecture. The goal of the PS5's and the Series X's I/O implementation is to increase the complexity of the content presented on screen without a corresponding increase in load times/memory footprint but go about it in totally different ways. Since the end of the cartridge era, an increase in geometry/texture complexity was usually accompanied by an increase in load times. This was because while RAM bandwidth might be adequate, the thoughput of the link feeding the RAM from the HDD was not. Hence, HDDs and the associated I/O architecture was the bottleneck.
One way to address this issue was to "cache" as much as possible in the RAM so as to get around the aforementioned bottleneck. However, this solution comes with its own problem in that the memory footprint just kept ballooning ("MOAR RAM"). This is brilliantly explained by Mark Cerny in his GDC presentation with the 30s of gameplay paradigm. Playstations answer to this problem is to increase the throughput to the RAM in an unprecedented way. Thus, instead of caching for the next 30s of gameplay, you might only need to cache for only the next 1s of gameplay which results in a drastic reduction in memory print. Indeed, the point of it all is that for a system with the old HDD architecture to have the same jump in texture and geometry complexity, either the amount of RAM needed for caching will have to be exorbitant or frametime will have to be increased to allow enough time for the texture to stream in (low framerates) or gameplay design will have to be changed to allow for texture loading (long load times). The PS5 supposedly will achieve all of this with none of those drawbacks thanks to alleviating the bottleneck between persistent memory and RAM (the bottleneck still exists because RAM is still quicker than the SSD but it is good enough for the PS5 rendering capacity and hence doesn't matter anyway. You just don't load textures from SSD to the screen.)

We can now see why the throughput from the SSD to RAM has now become the one-and-only metric for judging the I/O capability of next-gen systems in the mind of gamers. After all, it does make perfect sense. BUT...is there an alternative way of doing things? Microsoft's went in a completely different direction. Is the Persistent memory to RAM throughput still the bottleneck? Yes! Why is more throughput needed? To stream more textures evidently. The defining question is then how much of it is actually needed? After careful research by assessing how games actually utilised textures on a per frame basis, MS seems to have come to a surprising answer: not that much actually.

Indeed, by loading higher detailed MIPs than necessary and keeping the Main memory - RAM throughput constant, load times/memory footprint is increased. Lets quote Andrew Goosen in the Eurogamer deep-dive for reference:

"We observed that typically, only a small percentage of memory loaded by games was ever accessed," reveals Goossen. "This wastage comes principally from the textures. Textures are universally the biggest consumers of memory for games. However, only a fraction of the memory for each texture is typically accessed by the GPU during the scene. For example, the largest mip of a 4K texture is eight megabytes and often more, but typically only a small portion of that mip is visible in the scene and so only that small portion really needs to be read by the GPU."

The upshot of it all is that by knowing what MIP levels are actually needed on a per-frame basis and loading only that, the amount needed to be streamed is radically reduced and so is the throughput requirement of the SSD-RAM link as well as the RAM footprint. Can this Just-in time streaming solution be implement ed via software? MS indeed acknowledges that it is possible to do so but concedes that it is very inaccurate and requires changes to shader/application code. The hardware implementation of determining residency maps associated with partially resident textures is sampler feedback.

While sampler feedback is great, it is not sampler feedback streaming. You now need hardware implementation for :

(1) transition from a lower MIP-level to a higher one seamlessly
(2) fallback to a lower MIP-level if the requested one is not yet resident in memory and to blend back to the higher one when it comes available after a few frames.

Microsoft claims to have devised a hardware implementation for doing just that. This is the so-called "texture filters" described by James Stanard. Do we have more information about Microsoft's implementation? Of course we do. SFS is patented hardware technologgy and is described in patent US10388058B2 titled
"Texture residency hardware enhancements for graphics processors" with co-inventors being Mark S Grossman and....Andrew Goosen.

Combined with Directstorage (presuambly a new API that revamps the file system but information about it is sparse) and the constant high throughput of the SSD, this is how Microsoft claims to achieve 2x-3x increase in efficiency. Hence, the "brute force" meme about the series X is wildly off-base.

As for which of the PS5 or Series X I/O system is better? I say let the DF face-offs begin.

End quote

New quote(starts with quoting ShiftyGeezer)

I will quote your own thoughts on the matter as response (from the UE5 thread):

"The moment the data is arranged this way, we can see how virtualised textures would also apply conceptually to the geometry in a 2D array, along with how compression can change from having to crunch 3D data. You don't need to load the whole texture to show the model, but only the pieces of it that are viewable, which is the same problem as picking which texture tiles with virtual texturing.

Very clever stuff."

Ronaldo8:

The Unreal Engine team has a devised a software solution for a problem that Microsoft has resolved in hardware.

But sampler feedback in truth answers two questions:
(1) What MIP level was utimately sampled (LOD problem). What MIP level to load next
(2) Where exactly in the resource was it sampled (which tiles was sampled). This is based on what's visible to the camera. Basically what MIP to load next.

SFS is the streaming of only visible assets at the correct level of details. So yeah, software implementation of a solution already found in hardware.

End new quote

Read this article:

Coming to DirectX 12— Sampler Feedback: some useful once-hidden data, unlocked - DirectX Developer Blog

Why Feedback: A Streaming Scenario Suppose you are shading a complicated 3D scene. The camera moves swiftly throughout the scene, causing some objects to be moved into different levels of detail. Since you need to aggressively optimize for memory, you bind resources to cope with the demand for...

devblogs.microsoft.com

And one more quote from Scott_Arm

Start quote

Microsoft's solution is virtual texturing with sampler feedback for accurate mip and tile selection, plus some hardware filters to blend from a low resolution mip and a high resolution mip in case the high resolution mip is not loaded in time for the current frame. So they have some guarantee of the low quality mip arriving on time and then blend to the high quality of it's late so you don't notice pop in. It should be overall more efficient in making sure they don't waste memory on pages they don't need

End quote

This sums everything up perfectly, thank you for that. Basically the difference isn't as big as the SSD speeds would have you think, and we can look forward to amazing games utilising these fast solution. June/July game events can't come soon enough.

Panajev2001a · May 23, 2020

X-Fighter said:
This sums everything up perfectly, thank you for that. Basically the difference isn't as big as the SSD speeds would have you think

The Ronaldo user from B3D is speculating and comparing apples to oranges: the scenario he paints puts Sony as brute forcing it by pushing the SSD I/O throughput through the roof while only XVA Would be doing hardware assisted virtual texturing / texture streaming to make the gap appear much much smaller if not absent.

SFS may make virtual texturing more accessible to developers as there is less to reimplement in software and it may be more efficient (easier to get the right thing done), but it is not an entirely new paradigm bringing in such improvements of such magnitude. XVA is a great step forward for the industry, it is not the end of the world if PS5's SSD solution is still a lot faster all things considered (virtual texturing with or without tiled resources / PRT and a 2x faster SSD I/O pipe).

Deleted member 775630 · May 23, 2020

Panajev2001a said:
The Ronaldo user from B3D is speculating and comparing apples to oranges: the scenario he paints puts Sony as brute forcing it by pushing the SSD I/O throughput through the roof while only XVA Would be doing hardware assisted virtual texturing / texture streaming to make the gap appear much much smaller if not absent.

He clearly explains the two different approaches, and you might read it as brute force approach, but that's not how I read it. Just two different approaches to solve the same problem. The fact that Microsoft says that they can instantly access 100GB should tell you enough about the magnitude of the implementation.

geordiemp · May 23, 2020

Trueblakjedi said:
To clarify, the XSX memory is reported to be 6 2GB modules consisting of an upper and lower memory address (=12 GB) and 4x1 GB memory modules making 10 memory segments.

These 10 modules are accessible by 2 bidirectional 16-bit lanes per module (32 bit X 10 lanes = 320 bit bus). Here is where info gets a little murky about the architecture:

Those 16 bit lanes should be able to access all memory addresses to which they are attached.

But the way the architects describe it, the CPU is assigned access to 6 out of the 10 lanes. The CPU is reserved access to the upper memory address range of the 2GB modules only.

The lower 1GB memory address on those 2GB chips are reserved for GPU work (6GB).

The 4 1GB modules are reserved for the GPU at all times... So the lower 6 GB of the 2GB modules plus the 4X1 GB = 10GB of VRAM. The 2GB might be subject to contention depending upon whether accessing the upper 1GB of the 2Gb module uses both lanes at full speed (both 16 bit lanes @56GB/s) or half speed (1 16 bit lane at 28GB/s).

This setup is quite confusing because it seems that if the GPU were to use the full band access to all 10 modules, the entire system bandwidth is used and the CPU doesn't have access to the other 6 GB.

I can't find a permutation of accesses by CPU and GPU where they both can use their full bandwidth simultaneously. The most I could come up with is if the GPU took the full bandwidth of the 4x1gb chips (4X56 = 224 GB/s) plus using half the bandwidth to access the lower memory address of the 2GB chips (28*6 = 168GB/S) for 392 GB/s max bandwidth without contention. The CPU would be the consumer of the remainder bandwidth (168GB/s).

Sorry for the long post.

A nice article on how Nvidia does GDDR6 is useful reading, there are 4 actions per clock going on and requires timing of signals, and memory is read in a slice across all the modules to give the desired speed. Hence its hard to read differing parts due to timing differences if you dont slice "evenly"...

But we have no info other than the mS spec. It is almost a analogue / waveform signal contraint and as differeing memory is not a normal thingw e see, I am sure the teardown will be very interesting.

The signals from GDDR6 memory are not what you would think of as normal digital stuff, its complex.

funking giblet · May 23, 2020

X-Fighter said:
The fact that Microsoft says that they can instantly access 100GB should tell you enough about the magnitude of the implementation.

This isn't anything unique, I can guarantee the PS5 will do the same thing. It's not some magic, it just works better with fast storage.

Panajev2001a · May 23, 2020

X-Fighter said:
He clearly explains the two different approaches, and you might read it as brute force approach, but that's not how I read it. Just two different approaches to solve the same problem. The fact that Microsoft says that they can instantly access 100GB should tell you enough about the magnitude of the implementation.

I think he is comparing apples to oranges, Sony's approach seems roughly similar to the approach MS took with XVA with likely more logic spent to accelerate SSD I/O and higher throughout (GB/s).
XVA is a marketing term for the solution Sony have no name to, maybe they should have called it Lighting Data Transfer Architecture or Infinite RAM Architecture

.

The block Cerny described in his presentation seems not to be trivial designed, but something they spent quite a bit of transistors and R&D time on, resources they did not add to the GPU hence the TFLOPS gap... they are probably happy as they were able to drive a bigger gap in I/O than the one they lost to XSX in the TFLOPS war).

Jon Neu · May 23, 2020

X-Fighter said:
He clearly explains the two different approaches, and you might read it as brute force approach, but that's not how I read it. Just two different approaches to solve the same problem. The fact that Microsoft says that they can instantly access 100GB should tell you enough about the magnitude of the implementation.

Exactly.

I think I should visit Beyond3D more, seems to be far more information there without the obvious agenda.

Deleted member 775630 · May 23, 2020

Panajev2001a said:
I think he is comparing apples to oranges, Sony's approach seems roughly similar to the approach MS took with XVA with likely more logic spent to accelerate SSD I/O and higher throughout (GB/s).
XVA is a marketing term for the solution Sony have no name to, maybe they should have called it Lighting Data Transfer Architecture or Infinite RAM Architecture .

The block Cerny described in his presentation seems not to be trivial designed, but something they spent quite a bit of transistors and R&D time on, resources they did not add to the GPU hence the TFLOPS gap... they are probably happy as they were able to drive a bigger gap in I/O than the one they lost to XSX in the TFLOPS war).

Yeah I get that. It's just that Microsoft is basically saying we believe you don't need such a fast I/O when applying our XVA solution. But the games will tell.

Panajev2001a · May 23, 2020

X-Fighter said:
Yeah I get that. It's just that Microsoft is basically saying we believe you don't need such a fast I/O when applying our XVA solution. But the games will tell.

Sure, I do not see XVA being something you apply, but the definition of what they have assembled. Your statement, which may be fine (I disagree, but that is neither here nor there), is essentially saying that the extra band width 4.8 GB/s to 8-9 GB/s is not needed and maybe it is not needed to the console experience MS wants to drive overall and they invested their budget elsewhere (and it shows, it is a solid piece of HW

).

This is now what you said earlier and what that Ronaldo poster from B3D implied: that the I/O gap was actually really narrow because of XVA or something. I do not think it is the case m, but if you have evidence of the contrary I would be happy to discuss it over here please

.

Wandering Leopard · May 23, 2020

The use of the word "instantly" in this scenario doesn't really mean anything - unless MS defines what it means in a specific context (for the time being, it can only be described as a relative term).

The i/o throughput numbers for both platforms have been officially revealed, and if this doesn't fall within the scope of the revealed i/o numbers, MS should come up with a new metric which can explain it.

Panajev2001a · May 23, 2020

Jon Neu said:
Exactly.

I think I should visit Beyond3D more, seems to be far more information there without the obvious agenda.

What agenda? Free to use one or both forums or whatever, but I doubt you will not read the same pushback to magic claims.

Deleted member 775630 · May 23, 2020

Panajev2001a said:
This is now what you said earlier and what that Ronaldo poster from B3D implied: that the I/O gap was actually really narrow because of XVA or something. I do not think it is the case m, but if you have evidence of the contrary I would be happy to discuss it over here please .

The evidence is the interviews with Microsoft engineers saying that 100 GB is instantly accessible. Unless you are saying their engineers are lying? Obviously I can't prove what they are saying because I don't have the hardware so I can't run tests myself. So if they say that this is possible, I just believe Microsoft, since we can't do our own tests at the moment.

THE:MILKMAN · May 23, 2020

Can someone explain what is meant, or believed to be meant, by 'instantly accessible'?

funking giblet · May 23, 2020

X-Fighter said:
The evidence is the interviews with Microsoft engineers saying that 100 GB is instantly accessible. Unless you are saying their engineers are lying? Obviously I can't prove what they are saying because I don't have the hardware so I can't run tests myself. So if they say that this is possible, I just believe Microsoft, since we can't do our own tests at the moment.

They reserve 100gb for paging. That's it. Nothing stopping Sony doing the same.

CobraXT · May 23, 2020

Why SSD suppling data to the GPU directly is a good thing ? SSD still nowhere the performance of GDDR6. isn't SSD should supply data to the VRAM then gpu ?

funking giblet · May 23, 2020

CobraXT said:
Why SSD suppling data to the GPU directly is a good thing ? SSD still nowhere the performance of GDDR6. isent SSD should supply data to the VRAM then gpu ?

It's delivering to VRAM directly, this doesn't usually happen on your PC. It's directly mappable to the VRAM and the SSD can be used as a fast swap space.

ToadMan · May 23, 2020

Journey said:
XSX has 45% more CUs than the PS5, an even bigger factor than PS4 over XBO.

Perhaps you could show how you arrive at this conclusion....

CobraXT · May 23, 2020

funking giblet said:
It's delivering to VRAM directly, this doesn't usually happen on your PC. It's directly mappable to the VRAM and the SSD can be used as a fast swap space.

And on XSX you are limited to 100gb to do this but ps5 you should have the entire SSD to do it instantly , right ?

funking giblet · May 23, 2020

CobraXT said:
And on XSX you are limited to 100gb to do this but ps5 you should have the entire SSD to do it instantly , right ?

You typically reserve space on the drive for this, or a dedicated partition. If you did this to your whole drive, well then you would probably not have many games there

You can map an entire 64bit address space to a disk. That's a pretty large disk.

CobraXT · May 23, 2020

ToadMan said:
Perhaps you could show how you arrive at this conclusion....

I still think the performance difference will lean more towards the CUs number more the th TF number deference

Sony PlayStation 5 vs the Xbox Series X: Why the PS5 Will Offer Less than 10.5 TFLOPs of Performance (Mostly) | Hardware Times

A few days back, Microsoft jumped the gun and decided to unveil the detailed hardware specifications of its next-gen Xbox Series X console. This forced Sony to do the same in the next couple of days. However, there was a stark difference between the hardware reveal of the two consoles. Sony’s...

www.hardwaretimes.com

Bumping the clocks up to 2.1 GHz on the RX 5700–an 18 percent boost over stock, generally yields just 5-10 percent higher performance. This is due to various reasons including a sensitive boosting behavior

Support NeoGAF

Xbox Velocity Architecture - 100 GB is instantly accessible by the developer through a custom hardware decompression block

Banned

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Banned

Member

Banned

Member

Member

Banned

Member

Member

Banned

Member

Member

Member

Member

Member

Member

Member

Member

Member

Deleted member 775630

Unconfirmed Member

GAF's Pleasant Genius

Deleted member 775630

Unconfirmed Member

Member

Member

GAF's Pleasant Genius

Banned

Deleted member 775630

Unconfirmed Member

GAF's Pleasant Genius

Member

GAF's Pleasant Genius

Deleted member 775630

Unconfirmed Member

Member

Member

Banned

Member

Member

Banned

Member

Banned

Similar threads