• Hey, guest user. Hope you're enjoying NeoGAF! Have you considered registering for an account? Come join us and add your take to the daily discourse.

Xbox Velocity Architecture - 100 GB is instantly accessible by the developer through a custom hardware decompression block

Panajev2001a

GAF's Pleasant Genius
So it’s 12 GB/s, things just got interesting. It does not run that but it compares to 12 GB/s, as you can’t go faster than 4.8 GB/s.

Sure, but by the same token you can double or triple the number on PS5’s side the same way. While the enhanced PRT implementation they have on PS5 may not be as automated or efficient as XSX’s one (more likely to cause some minor shader/compute cost not to transfer a lot less data) it still allows you to load in Imemory less data as you only stream in what is actually needed.

So, this number across both is the bandwidth you would need to transfer the full data without using tiles resources/PRT/SFS or virtual texturing schemes. Not incorrect but can be misleading too especially if used to compare XSX to XOX or to PS5 or to PS4.
 
Last edited:

GODbody

Member
Sure, but by the same token you can double or triple the number on PS5’s side the same way. While the enhanced PRT implementation they have on PS5 may not be as automated or efficient as XSX’s one (more likely to cause some minor shader/compute cost not to transfer a lot less data) it still allows you to load in Imemory less data as you only stream in what is actually needed.

So, this number across both is the bandwidth you would need to transfer the full data without using tiles resources/PRT/SFS or virtual texturing schemes. Not incorrect but can be misleading too especially if used to compare XSX to XOX or to PS5 or to PS4.

Many are focusing on the bandwidth savings but the magic and revolution comes from the streaming aspect.




This speaks on transferring pages as needed on a per frame basis, keeping only the part of the residency sample that is visible resident in memory and beginning data transfer for the rest of the residency sample for the next frame, streaming the pages in and out as needed delivering them "just in time" for the pages to be utilized by the GPU

This process is hardware accelerated and has moved the residency map to dedicated hardware giving performance back to the GPU. Due to PS5's current GPU setup I don't think it's reasonable for them to increase the gap between their GPU and the Series X GPU in order to try and emulate SFS.

But yes, if devs want to try and emulate SFS they probably could. Would this be feasible though? By taking a hit to GPU and CPU performance and, depending on the latency of the data delivery and if the system can sustain a consistent bandwidth, they could deliver something similar to SFS. But this implementation would likely not be on a frame by frame basis, and would have worse LOD trasitions without the custom texture filters.

the streaming of data on a per frame basis and only keeping resident pages of what is needed by the GPU in the next frame in memory is the difference between PRT and SFS. that's where the real bandwidth savings and memory savings happens. They've also improved upon the texture filtering process to improve blending and reduce visible LOD transitions.

A first enhancement includes a hardware residency map feature comprising a low-resolution residency map that is paired with a much larger PRT, and both are provided to hardware at the same time. The residency map stores the mipmap level of detail resident for each rectangular region of the texture. PRT textures are currently difficult to sample given sparse residency. Software-only residency map solutions typically perform two fetches of two different buffers in the shader, namely the residency map and the actual texture map. The primary PRT texture sample is dependent on the results of a residency map sample. These solutions are effective, but require considerable implementation changes to shader and application code, especially to perform filtering the residency map in order to mask unsightly transitions between levels of detail, and may have undesirable performance characteristics. The improvements herein can streamline the concept of a residency map and move the residency map into a hardware implementation.

A second enhancement includes an enhanced type of texture sample operation called a “residency sample.” The residency sample operates similarly to a traditional texture sampling, except the part of the texture sample that requests texture data from cache/memory and filters the texture data to provide an output value is removed from the residency sample operation. The purpose of the residency sample is to generate memory addresses that reach the page table hardware in the graphics processor but do not continue on to become full memory requests. Instead, the residency of the PRT at those addresses is checked and missing pages are non-redundantly logged and requested to be filled by the OS or a delegate.
 

Panajev2001a

GAF's Pleasant Genius
Many are focusing on the bandwidth savings but the magic and revolution comes from the streaming aspect.




This speaks on transferring pages as needed on a per frame basis, keeping only the part of the residency sample that is visible resident in memory and beginning data transfer for the rest of the residency sample for the next frame, streaming the pages in and out as needed delivering them "just in time" for the pages to be utilized by the GPU

This process is hardware accelerated and has moved the residency map to dedicated hardware giving performance back to the GPU. Due to PS5's current GPU setup I don't think it's reasonable for them to increase the gap between their GPU and the Series X GPU in order to try and emulate SFS.

But yes, if devs want to try and emulate SFS they probably could. Would this be feasible though? By taking a hit to GPU and CPU performance and, depending on the latency of the data delivery and if the system can sustain a consistent bandwidth, they could deliver something similar to SFS. But this implementation would likely not be on a frame by frame basis, and would have worse LOD trasitions without the custom texture filters.

the streaming of data on a per frame basis and only keeping resident pages of what is needed by the GPU in the next frame in memory is the difference between PRT and SFS. that's where the real bandwidth savings and memory savings happens. They've also improved upon the texture filtering process to improve blending and reduce visible LOD transitions.


SFS is the application of the GPU provided feedback (the SF part that tells you what the GPU was trying to render) to help you automatically stream in data and trigger page faults before you will need to actually use the data (additional instructions that do not block until the sampled texture reaches GPU memory, but verify that the data is there or essentially triggers the page fault that will cause that to be loaded in). This is now done for you in HW at a low cost, but not a zero cost.

Concerning SF/SFS, there is nothing I am aware of in the DX tech literature (videos and docs presenting this feature to devs) that suggest that a.) this is magical and a game changer and most importantly b.) that SF is free and you should rely on it to stream your resources in and out.

I doubt that a.) Sony has stuck to 11 years old PRT without any HW enhancements and b.) that both solutions PRT+async compute shaders (there are more free otherwise stalled resources than you would think) have much different per frame characteristics.

Why were SF and SFS invented? They improve the state of the art or codify vendor specific implementations in the DX spec and democratise a feature/make it easier to use: both consoles have small main RAM compared to the previous generation (2x jump or less if you consider XOX means you must be able to do per frame streaming).

BTW, per frame texture streaming is again not new: N64 and PS2 heavily depended on it.
 

ZywyPL

Banned
The SFS sounds really cool, but the real question is - is it something built into the system, automatic, with no effort required from the devs, or does it actually requires a lot of work and engines modifications from the devs to make it actually work? Because if it's the latter, them I don't think it will be used outside of 1st part games, as always with proprietary solutions.
 
P Panajev2001a I don't think it's ever been suggested SF is "free"; free of an otherwise heightened silicon resource cost if implemented in a more generic fashion? Of course. But a feature that's just there to automate the process? No. That's part of the reason the GPU customization for texture/mip blending has been added, in case there is a miss and the higher-level mip isn't available in time.

That should suggest on it's own there is a "cost" in terms of programming efficiency but that really does fall back on the programmer of the software and might come with a slight learning curve to master. I'm assuming a dev can utilize ML to "train" a behavior for SFS logic and rely on the dedicated hardware to implement it as required, since it's tailored to the purpose.

No, I don't think Sony's just sticking to the same PRT features in PS4 and simply "port that over" to PS5; they've probably done some work on it. But, there's nothing guaranteeing they've taken the same approach as SFS and there's also no guarantee their approach will be as efficient, either. Assuming any sort of PRT "2.0" implementations in PS5 also leveraging async compute are present, you'd still have to consider how much async compute resources you'd need to push near the throughput of MS's efforts with SFS. And even in that case, SFS can probably be usable alongside more generic supplemental approaches that can leverage async compute of the Series GPU all the same.

I think the biggest question would be if any hypothetical PRT "2.0" on PS5 has the level of customization needed to warrant explicit mention by Sony (honestly I think if this were the case they would mention it in Road to PS5 wouldn't they?), and has been scaled up to a point of being usable for next-gen data streaming workloads. At the very least, whatever areas on this Sony are not focused on, devs can rely on roughly analogous approaches in engines like UE5, even if it means using a few more system resources to simulate it.

The SFS sounds really cool, but the real question is - is it something built into the system, automatic, with no effort required from the devs, or does it actually requires a lot of work and engines modifications from the devs to make it actually work? Because if it's the latter, them I don't think it will be used outside of 1st part games, as always with proprietary solutions.

It'll probably be a mixture leaning with most weight to a more automated process, most likely. But, it will require some effort: there's a reason there's blending hardware for the mips in case there's a miss on the higher-quality mip in time.

Devs probably need to get acquainted with it a bit but, since it's a (pretty massive) evolution from PRT in terms of the foundation, devs should be able to get easily familiar with it and can push utilization of it throughout the hardware's life cycle.
 
Last edited:

Panajev2001a

GAF's Pleasant Genius
P Panajev2001a I don't think it's ever been suggested SF is "free"; free of an otherwise heightened silicon resource cost if implemented in a more generic fashion? Of course. But a feature that's just there to automate the process? No. That's part of the reason the GPU customization for texture/mip blending has been added, in case there is a miss and the higher-level mip isn't available in time.

That should suggest on it's own there is a "cost" in terms of programming efficiency but that really does fall back on the programmer of the software and might come with a slight learning curve to master. I'm assuming a dev can utilize ML to "train" a behavior for SFS logic and rely on the dedicated hardware to implement it as required, since it's tailored to the purpose.

No, I don't think Sony's just sticking to the same PRT features in PS4 and simply "port that over" to PS5; they've probably done some work on it. But, there's nothing guaranteeing they've taken the same approach as SFS and there's also no guarantee their approach will be as efficient, either. Assuming any sort of PRT "2.0" implementations in PS5 also leveraging async compute are present, you'd still have to consider how much async compute resources you'd need to push near the throughput of MS's efforts with SFS. And even in that case, SFS can probably be usable alongside more generic supplemental approaches that can leverage async compute of the Series GPU all the same.

I think the biggest question would be if any hypothetical PRT "2.0" on PS5 has the level of customization needed to warrant explicit mention by Sony (honestly I think if this were the case they would mention it in Road to PS5 wouldn't they?), and has been scaled up to a point of being usable for next-gen data streaming workloads. At the very least, whatever areas on this Sony are not focused on, devs can rely on roughly analogous approaches in engines like UE5, even if it means using a few more system resources to simulate it.



It'll probably be a mixture leaning with most weight to a more automated process, most likely. But, it will require some effort: there's a reason there's blending hardware for the mips in case there's a miss on the higher-quality mip in time.

Devs probably need to get acquainted with it a bit but, since it's a (pretty massive) evolution from PRT in terms of the foundation, devs should be able to get easily familiar with it and can push utilization of it throughout the hardware's life cycle.

Lots of valid points and surely they put a lot of emphasis on this tech in the context of XVA as they had tons more work to do to overcome the developers and customers perception around how it compares against Sony’s solution and it does not surprise me if you go from ~2x in raw bandwidth difference to 1.6-1.8x when all is said and done (counting some improvements MS may have done in the most optimistic way possible and assuming Sony does not have any comparable solution and thus taking their work on texture streaming capability improvements in the worst possible way).

I do not think that the “if they had improvement X over feature Y in the road to PS5 talk” argument is as tight a proof as you posit here. Especially in terms of advancing the three pillars the presentation was centered on (pursuing the SSD dream or listening to developers, balancing evolution and revolution, and finding new dreams or Tempest 3D audio).

If we wanted to take this quite literally maybe PS5 dropped PRT and tiles resources altogether as it was not mentioned in the presentation at all and we could write paragraphs about HW features and transistors used to implement them, clock speed, and how the two are related... thus reaching the conclusion that yes perhaps PRT was indeed removed as the SSD is fast enough with compression in and they sacrificed ALL features not mentioned in the presentation explicitly to lower complexity and increase clock sped.
 
Last edited:

GODbody

Member
SFS is the application of the GPU provided feedback (the SF part that tells you what the GPU was trying to render) to help you automatically stream in data and trigger page faults before you will need to actually use the data (additional instructions that do not block until the sampled texture reaches GPU memory, but verify that the data is there or essentially triggers the page fault that will cause that to be loaded in). This is now done for you in HW at a low cost, but not a zero cost.

Concerning SF/SFS, there is nothing I am aware of in the DX tech literature (videos and docs presenting this feature to devs) that suggest that a.) this is magical and a game changer and most importantly b.) that SF is free and you should rely on it to stream your resources in and out.

I doubt that a.) Sony has stuck to 11 years old PRT without any HW enhancements and b.) that both solutions PRT+async compute shaders (there are more free otherwise stalled resources than you would think) have much different per frame characteristics.

Why were SF and SFS invented? They improve the state of the art or codify vendor specific implementations in the DX spec and democratise a feature/make it easier to use: both consoles have small main RAM compared to the previous generation (2x jump or less if you consider XOX means you must be able to do per frame streaming).

BTW, per frame texture streaming is again not new: N64 and PS2 heavily depended on it.

SFS doesn't cause the GPU to stall and wait for the data in question, it delivers a lower quality mip if the higher quality one isn't readily available. So it's not a fair comparison to compare it to a page fault as a page fault would cause the application to wait for the data, thus stalling the GPU and increasing frame-time

I'm sure Sony has made improvements to PRT but using PRTs does increase frame-times and if they had made some significant enough improvements to PRTs to reduce the footprint of data enough to make it note worthy I'm sure they would have mentioned that.

The Series X spec breakdown by Digital Foundry
A technique called Sampler Feedback Streaming - SFS - was built to more closely marry the memory demands of the GPU, intelligently loading in the texture mip data that's actually required with the guarantee of a lower quality mip available if the higher quality version isn't readily available, stopping GPU stalls and frame-time spikes. Bespoke hardware within the GPU is available to smooth the transition between mips, on the off-chance that the higher quality texture arrives a frame or two later. Microsoft considers these aspects of the Velocity Architecture to be a genuine game-changer, adding a multiplier to how physical memory is utilised.

MS considers it a game changer

While the N64 and PS2 had per frame streaming of data they weren't streaming 4k+ textures at 2.4 GB/s at a target of 60 fps

Lots of valid points and surely they put a lot of emphasis on this tech in the context of XVA as they had tons more work to do to overcome the developers and customers perception around how it compares against Sony’s solution and it does not surprise me if you go from ~2x in raw bandwidth difference to 1.6-1.8x when all is said and done (counting some improvements MS may have done in the most optimistic way possible and assuming Sony does not have any comparable solution and thus taking their work on texture streaming capability improvements in the worst possible way).

I do not think that the “if they had improvement X over feature Y in the road to PS5 talk” argument is as tight a proof as you posit here. Especially in terms of advancing the three pillars the presentation was centered on (pursuing the SSD dream or listening to developers, balancing evolution and revolution, and finding new dreams or Tempest 3D audio).

If we wanted to take this quite literally maybe PS5 dropped PRT and tiles resources altogether as it was not mentioned in the presentation at all and we could write paragraphs about HW features and transistors used to implement them, clock speed, and how the two are related... thus reaching the conclusion that yes perhaps PRT was indeed removed as the SSD is fast enough with compression in and they sacrificed ALL features not mentioned in the presentation explicitly to lower complexity and increase clock sped.

MS have already stated that SFS is an average improvement of 2.5x on bandwidth and memory. James Stanard says this stacks with compression. There's no need to suggest anything otherwise.

 
Last edited:

geordiemp

Member
This process is hardware accelerated and has moved the residency map to dedicated hardware giving performance back to the GPU. Due to PS5's current GPU setup I don't think it's reasonable for them to increase the gap between their GPU and the Series X GPU in order to try and emulate SFS.

Source of hardware accelerated and dedicated hardware ?

All I read was software or what its doing (function) or benefit, not what dedicated hardware is used ?

The only new bespoke hardware discussed by Goosen is below

A technique called Sampler Feedback Streaming - SFS - was built to more closely marry the memory demands of the GPU, intelligently loading in the texture mip data that's actually required with the guarantee of a lower quality mip available if the higher quality version isn't readily available, stopping GPU stalls and frame-time spikes. Bespoke hardware within the GPU is available to smooth the transition between mips, on the off-chance that the higher quality texture arrives a frame or two later.
 
Last edited:

just tray

Member
Faster isn't always better. A game still needs to operate in real time and a SSD is no GPU.

Both systems will be great but we are talking about an Xbox vs PS2 if not greater performance Delta and July 23rd not only will Halo impress but 60fps and 120 fps games. I suspect like Dirt 5, many games will op for a sub 4k 120 fps or checkerboard approach. PS5 is good but not on par with Series X.

It's not like you have to only stream in textures. What about A.I. and physics? What about ray traced levels and world's? The GPU still has to be fed info.

SSDs are not GPUs for the last time!
 

Panajev2001a

GAF's Pleasant Genius
SFS doesn't cause the GPU to stall and wait for the data in question, it delivers a lower quality mip if the higher quality one isn't readily available. So it's not a fair comparison to compare it to a page fault as a page fault would cause the application to wait for the data, thus stalling the GPU and increasing frame-time
No, I said that the instruction they added/described avoids the stall not that it causes it. In virtual memory terms it is a page miss that is generated by a custom non blocking sample instruction which does not wait for the data to become available. I am not sure why you are taking an issue with the least contentious part of what I wrote hehe.

I'm sure Sony has made improvements to PRT but using PRTs does increase frame-times and if they had made some significant enough improvements to PRTs to reduce the footprint of data enough to make it note worthy I'm sure they would have mentioned that.
Not sure they needed to hence why they did not, at least in the public Road to PS5 video. That presentation had three clear goals to hit properly, did its job just fine, and was already 54 minutes long or so.

MS considers it a game changer
... and? Consider it? Market it? Be it? And...? Sorry, they both are saying the same thing: transformative and changed the way gaming was before It. Sony has already announced a faster than any commercially available SSD’s solution back in early 2019, MS had more motive to try to change the narrative on it.

While the N64 and PS2 had per frame streaming of data they weren't streaming 4k+ textures at 2.4 GB/s at a target of 60 fps
PS2 had lots of 60 FPS games and had only 4 MB of total scratch pad video memory (depth, front, back, and off screen render buffers shared the same 4 MB space as texture memory... say in many cases you had ~2 MB for texture data per frame not heavily compressed in terms of how it was generally stored in video memory) and 32 MB of main RAM running at 3.2 GB/s (and 2 MB of RAM for the sound processor) and a 4.7/8 GB DVD disc. Nevertheless it relied on streaming assets, individual mip levels and portion of textures per frame: sometimes decompressing them on the GS itself and sometimes they were JPEG like images decompresses by the IPU on demand (150 MPixels/s worth of data).

Comparatively PS2 was streaming a huge amount of data.

MS have already stated that SFS is an average improvement of 2.5x on bandwidth and memory. James Stanard says this stacks with compression. There's no need to suggest anything otherwise.



Not disputing that, I am disputing that such a 2.5x multiplier has PRT based virtual texturing or similar as baseline.
 

Lort

Banned
No, I said that the instruction they added/described avoids the stall not that it causes it. In virtual memory terms it is a page miss that is generated by a custom non blocking sample instruction which does not wait for the data to become available. I am not sure why you are taking an issue with the least contentious part of what I wrote hehe.


Not sure they needed to hence why they did not, at least in the public Road to PS5 video. That presentation had three clear goals to hit properly, did its job just fine, and was already 54 minutes long or so.


... and? Consider it? Market it? Be it? And...? Sorry, they both are saying the same thing: transformative and changed the way gaming was before It. Sony has already announced a faster than any commercially available SSD’s solution back in early 2019, MS had more motive to try to change the narrative on it.


PS2 had lots of 60 FPS games and had only 4 MB of total scratch pad video memory (depth, front, back, and off screen render buffers shared the same 4 MB space as texture memory... say in many cases you had ~2 MB for texture data per frame not heavily compressed in terms of how it was generally stored in video memory) and 32 MB of main RAM running at 3.2 GB/s (and 2 MB of RAM for the sound processor) and a 4.7/8 GB DVD disc. Nevertheless it relied on streaming assets, individual mip levels and portion of textures per frame: sometimes decompressing them on the GS itself and sometimes they were JPEG like images decompresses by the IPU on demand (150 MPixels/s worth of data).

Comparatively PS2 was streaming a huge amount of data.



Not disputing that, I am disputing that such a 2.5x multiplier has PRT based virtual texturing or similar as baseline.

yup they said 2.5x current gen solutions... so that def includes virtual texturing and PRT.
 

Panajev2001a

GAF's Pleasant Genius
yup they said 2.5x current gen solutions... so that def includes virtual texturing and PRT.

No, it does not necessarily means that from what they said. It would not explain a 2.5 improvements in streaming bandwidth alone based on anything I have seen people here or Twitter or wherever. His, Stanard, examples mention loading only the portion of the texture visible instead of the whole texture and used that to derive the “multiplier”... you already have the fastest console TFLOPS wise, no need to add unsubstantiated/unrealistic stuff on top.
 
Last edited:

THE:MILKMAN

Member
yup they said 2.5x current gen solutions... so that def includes virtual texturing and PRT.

The question is what is the context of the 2.5x? James Stanard gives the example using standard 64KB tiles. This is tiny in RAM terms and a 2.5x multiplier of 64KB isn't so sexy and of course we don't yet know if Sony will have anything similar in any case.
 

GODbody

Member
Source of hardware accelerated and dedicated hardware ?

All I read was software or what its doing (function) or benefit, not what dedicated hardware is used ?

The only new bespoke hardware discussed by Goosen is below

It's in the patent filing.

A first enhancement includes a hardware residency map feature comprising a low-resolution residency map that is paired with a much larger PRT, and both are provided to hardware at the same time. The residency map stores the mipmap level of detail resident for each rectangular region of the texture. PRT textures are currently difficult to sample given sparse residency. Software-only residency map solutions typically perform two fetches of two different buffers in the shader, namely the residency map and the actual texture map. The primary PRT texture sample is dependent on the results of a residency map sample. These solutions are effective, but require considerable implementation changes to shader and application code, especially to perform filtering the residency map in order to mask unsightly transitions between levels of detail, and may have undesirable performance characteristics. The improvements herein can streamline the concept of a residency map and move the residency map into a hardware implementation.
That new bespoke hardware is the hardware based residency map.

No, I said that the instruction they added/described avoids the stall not that it causes it. In virtual memory terms it is a page miss that is generated by a custom non blocking sample instruction which does not wait for the data to become available. I am not sure why you are taking an issue with the least contentious part of what I wrote hehe.


Not sure they needed to hence why they did not, at least in the public Road to PS5 video. That presentation had three clear goals to hit properly, did its job just fine, and was already 54 minutes long or so.


... and? Consider it? Market it? Be it? And...? Sorry, they both are saying the same thing: transformative and changed the way gaming was before It. Sony has already announced a faster than any commercially available SSD’s solution back in early 2019, MS had more motive to try to change the narrative on it.


PS2 had lots of 60 FPS games and had only 4 MB of total scratch pad video memory (depth, front, back, and off screen render buffers shared the same 4 MB space as texture memory... say in many cases you had ~2 MB for texture data per frame not heavily compressed in terms of how it was generally stored in video memory) and 32 MB of main RAM running at 3.2 GB/s (and 2 MB of RAM for the sound processor) and a 4.7/8 GB DVD disc. Nevertheless it relied on streaming assets, individual mip levels and portion of textures per frame: sometimes decompressing them on the GS itself and sometimes they were JPEG like images decompresses by the IPU on demand (150 MPixels/s worth of data).

Comparatively PS2 was streaming a huge amount of data.



Not disputing that, I am disputing that such a 2.5x multiplier has PRT based virtual texturing or similar as baseline.

My apologies, I misinterpreted your statement on page faults.

You stated that
Concerning SF/SFS, there is nothing I am aware of in the DX tech literature (videos and docs presenting this feature to devs) that suggest that a.) this is magical and a game changer and most importantly b.) that SF is free and you should rely on it to stream your resources in and out.

I was just giving you an example of that. I don't know how they can call something a game changer without it sounding like marketing

As for the PS2 comparison I just felt it was a bit silly to bring up the texture streaming capabilities from 3 generations ago to a discussion on Next Gen consoles considering the order of magnitude leap in speeds.

The 2.5x multiplier should just be considered as it's been stated in official reports

This innovation results in approximately 2.5x the effective I/O throughput and memory usage above and beyond the raw hardware capabilities on average. SFS provides an effective multiplier on available system memory and I/O bandwidth, resulting in significantly more memory and I/O throughput available to make your game richer and more immersive.

It's likely not considering the efficiencies and savings brought forth by PRT just a flat multiplier on raw bandwidth and available memory (for textures). So the effective throughtput and memory (for textures) becomes 6 GB/s (which is what the speed of the decompression block is unironically) giving an effective bandwidth of 12 GB/s worth of textures if their compression ratio is 2:1 with the memory being able to hold roughly 25GB worth of textures in the texture optimized pool of memory

It's really going to depend on how effective their compression ratio for BCPack is. If they've pulled off some wizardry and gotten their compression ratio for textures down to something like 5:1 that's an effective 30 GB/s per second of texture data.
 
Last edited:
So Ronaldo8 on B3D found some interesting papers related to MS, published by Anirudh Badam as Principal Research Scientist, via IEEE.

microsoft.com/en-us/research/wp-content/uploads/2016/02/flashmap_isca2015.pdf

Here's some bits from the first couple of pages that might be of interest related to memory mapping and addressing latency:

Applications can map data on SSDs into virtual memory to transparently scale beyond DRAM capacity, permitting them to leverage high SSD capacities with few code changes.

Sounds a lot like usage of an HBCC being involved here to facilitate this.

Obtaining good performance for memory-mapped SSD content

They even literally say memory-mapped right here xD. So that relates with what the Louise Kirby person discussed on Twitter a month or so ago.

the file system and the flash translation layer (FTL) perform address translations, sanity and permission checks independently from each other. We introduce FlashMap, an SSD interface that is optimized for memory-mapped SSD-files. FlashMap combines all the address translations into page tables that are used to index files and also to store the FTL-level mappings without altering the guarantees of the file system or the FTL. It uses the state in the OS memory manager and the page tables to perform sanity and permission checks respectively.

By combining these layers, FlashMap reduces critical-path latency and improves DRAM caching efficiency. We find that this increases performance for applications by up to 3.32x compared to state-of-the-art SSD file-mapping mechanisms. Additionally, latency of SSD accesses reduces by up to 53.2%.

Bolded the pertinent parts. So here we see they are addressing the FTL (this is something in relation to an idea of what MS may've been doing that user 'function' mentioned on B3D, I linked their stuff in an earlier post), again refers to memory mapping, suggests the handling of this would be done in the OS (again tying into some things users 'functon' and 'DSoup' mentioned on B3D in their speculation), and indicates factor of improvements in this approach compared to current file mapping on current SSDs (Jason on Twitter mentioned scales of multiple in I/O throughput improvement that would come with SFS).

Of particular interest is the reduction noted in latency reduction by that 53.2%.

Doing some quick scanning reading more on this design they've been working on I'm coming across some other interesting things like this:

The SSD-location of the page must be stored elsewhere while the page is cached in DRAM. We design an auxiliary index to store the SSD-locations of all the pages cached in DRAM. The auxiliary index is implemented using a simple one-to-one correspondence between DRAM pages and the SSD-Location of the block that the DRAM page may hold – a simple array of 8 byte values

and this:

The aim of the experiments in this section is to demonstrate that FlashMap also brings benefits for high-end SSDs with much lower device latencies, as it performs single address translation, single sanity and permission check in the critical path to reduce latency.

5.3. DRAM vs. SSD-memory In this section, we analyze the cost effectiveness of using SSD as slow non-volatile memory compared to using DRAM with the aim of demonstrating FlashMap’s practical impact on dataintensive applications. We survey three large-scale memory intensive applications (as shown in Table 3) to conduct the cost-effectiveness analysis. For this evaluation, we ignore the benefits of non-volatility that SSDs have and purely analyze from the perspective of cost vs performance for workloads that can fit in DRAM today. Additionally, we analyze how real-world workloads affect the wear of SSDs used as memory.

We use three systems for the analysis: Redis which is an in-memory NoSQL database, MySQL with “MEMORY” engine to run the entire DB in memory and graph processing using the GraphChi library. We use YCSB for evaluating Redis, TPC-C [53] for evaluating MySQL, and page-rank and connected-component labeling on a Twitter social graph dataset for evaluating GraphChi. We modify these systems to use SSDs as memory in less than 50 lines of code each. The results are shown in Table 3. The expected life is calculated assuming 3,000 P/E and 10,000 P/E cycles respectively for the SATA and PCIe SSDs, and a write-amplification factor of two. The results show that write traffic from real-world workloads is not a problem with respect to wear of the SSD.

SSDs match DRAM performance for NoSQL stores. We find that the bottleneck to performance for NoSQL stores like Redis is the wide-area network latency and the router throughput. Redis with SATA SSD is able to saturate a 1GigE network router and match the performance of Redis with DRAM. Redis with PCIe SSD is able to saturate a 10GigE router and match the performance of Redis with DRAM. The added latency from the SSDs was negligible compared to the wide-area latency.

SSD-memory is cost-competitive when normalized for performance of key-value stores. For a 1TB workload, the SATA setup and PCIe setup cost 26.3x and 11.1x less compared to the DRAM setup ($30/GB for 32GB DIMMs, $2/GB for PCIe SSDs, $0.5/GB for SATA SSDs). The base cost of the DRAM setup is $1,500 higher as the server needs 32 DIMM slots and such servers are usually expensive because of specialized logic boards designed to accommodate a high density of DIMM slots.

Memory-mapped SSDs. Several systems have proposed using SSDs as memory [5, 6, 29, 44, 45, 47, 55, 57]. However, these systems do not present optimizations for reducing the address translation overhead. FlashMap is the first system to provide the benefits of filesystems, exploit the persistence of SSDs and provide the ability to map data on SSDs into virtual memory with low-overhead address translation.

Using SSDs as memory helps applications leverage the large capacity of SSDs with minimal code modifications. However, redundant address translations and checks in virtual memory, file system and flash translation layer reduce performance and increase latency. FlashMap consolidates all the necessary address translation functionalities and checks required for memory-mapping of files on SSDs into page tables and the memory manager. FlashMap’s design combines these layers but does not lose their guarantees. Experiments show that with FlashMap the performance of applications increases by up to 3.32x, and the latency of SSD-accesses reduces by up to 53.2% compared to other SSD-file mapping mechanisms.

In the future, we are taking FlashMap in two directions. First, we are investigating how to provide transactional and consistency guarantees by leveraging the proto-SSD for storing a transactional log. For example, we could leverage a log/journal on the proto-SSD to implement atomic modifications to the memory-mapped file. Second, we are investigating the benefits of combining the memory and file system layers for byte-addressable persistent memories. In particular, we are evaluating the benefits of a combined indirection layer for leveraging existing file system code as a control plane to manage persistent memory while leveraging virtual memory as a high-performance data plane to access persistent memory.

There's a lot here that is seemingly valuable for whatever MS is doing with XvA, so it's definitely worth giving a thorough read (I only skimmed parts quickly). At the very least it shows there's legitimate research they've put into resolving issues regarding latency and filesystem mapping that have produced tangible results, and some of the recent info we've learned regarding XvA features like SFS seem to be leveraging this work. Not only that, but it backs up a lot of speculation many of us have had throughout this thread and abroad regarding low-latency memory-mapped access, treating that 100 GB partition of the SSD as a natural extension of RAM, massively reducing overhead and filesystem bottlenecks, etc.
 
Yes, this paper is very interesting.

To me it seems Microsoft realized that even building a true monster of a machine (which they have with XSX) is not enough for making a true generational leap, so they've been trying to find clever ways to extract as much performance as possible from hardware they have. FlashMap, Sampler Feedback, Machine Learning, replacing traditional geometry pipeline with Mesh/Aplification shaders etc.
 
Memory Mapped I/O is a thing on all operating systems ever.

That's literally not the point of the papers that were linked or what the research focused on, obviously.

You took my own quick inference of a piece of info from the paper and are thinking that's what the paper actually focuses on? C'mon now.

Yes looked into this , if the 53% is correct the SSD will be just a little shy compared to the GDDR6, which makes you wonder .

Not necessarily. Maybe in terms of latency it significantly cuts down things getting in the way, but in terms of bandwidth that still would depend on the SSD's raw figures in that regard. Realistically that raw bandwidth never changes, so when we've seen comparisons of, say, 12 GB/s from Jason himself on Twitter, that's in reference to particular workloads and data, and how their solution greatly increases the throughput for those operations which would mean, if you were doing those given operations for the duration of a second, you'd get from that the equivalent of pulling 12 GB/s of data from storage on a more traditional approach.

But yeah, it's the great improvements upon latency that is the particularly stimulating thing here. Shaving off that much latency has big gains, and again we can turn to statements from people such as the DiRT 5 developer providing actual examples of this.
 
Last edited:
Read about Memory Mapped I/O, I'm not saying it's a bad thing, and fast disks help. Certainly Kirby Louise spouting something about BankSwitching for extended memory access, but that's not applicable.

Here's a scenario where this might work out, and a scenario I myself need to deal with on a daily basis.

I need to work with a large set of objects in memory, it may take time to build this set of objects (In my case, it takes several hours). When the application shuts down, the memory is lost, therefore I must restart the hydration process again. This process was moved to a memory mapped file which snapshotted occasionally. As our process is a linear process reading append only streams, it's easy to snapshot and recover to the latest state, but requires a fast disk to read / write to the snapshot on disk. For game data, this would be more random access. When dealing with the file, a pointer is created, then you use the set of objects as normal. Underneath all of this, is a view management process which figures out what pages to load into memory when accessing the discrete portions of state, and which to leave on disk. This is especially important when the snapshot itself is larger than that available RAM. Memory Mapped I/O + Virtualisation with fast I/O make this possible. So the only thing left for Microsoft to do to work with it, is structure their assets on disks as a Memory Mapped File, rather than a typical folder structure or whatever.

What I'm trying to get to the crux of is.

This is not magic, and as with every new console, magical thinking is predominant.

Think of it this way. If a console vendor could sell you an internet connection with sub 1ms latency and terabytes of bandwidth, they could offer insanely huge datasets streamed into the world, and it would be amazing. The big change here, is the internet, not the above techniques. The SSD is the change here, not some magical methods.
 
Last edited:

Bernkastel

Ask me about my fanboy energy!
From Hot Chips 2020
imeWPTjeeHSNYj2rMeEnwK-970-80.jpg.webp

Vfa2kZbhfRtxatptYquv4P-970-80.jpg.webp

np6N8pVh7k8ivrWeNXV64Q-970-80.jpg.webp

q7agLbZZBDWRAqyAvcRYKP-970-80.jpg.webp

dEHajYHnpr4y3bVd6p8MgK-970-80.jpg.webp

fYyE2QDsAqhmR6Sg6vtTnK-970-80.jpg.webp

3wbxHjgcsUNXNhtjHeyGAM-970-80.jpg.webp

vCHjLDZePBbCTUAKLZZkjM-970-80.jpg.webp

rVnXxNKBTJeiAjTa5kwHyM-970-80.jpg.webp

S8ygmd7CcGp4uGcRiz3PWQ-970-80.jpg.webp
 
Last edited:

Ascend

Member
i think ms did a really good job in making XsX a well rounded console.
Indeed they did. And I like the innovations. I'll need some time to understand the details of sampler feedback streaming. Now we have more information about it, but reading it once, I still am not sure what it is exactly doing. But it seems that you only load part of the texture that is visible, and that part is also divided into different LODs, and the LOD is loaded based on how far away the texture is from the camera. If it works like that, I can understand why they claim such a significant increase in bandwidth and RAM savings. It's basically reducing 'waste' as much as possible.
 
Indeed they did. And I like the innovations. I'll need some time to understand the details of sampler feedback streaming. Now we have more information about it, but reading it once, I still am not sure what it is exactly doing. But it seems that you only load part of the texture that is visible, and that part is also divided into different LODs, and the LOD is loaded based on how far away the texture is from the camera. If it works like that, I can understand why they claim such a significant increase in bandwidth and RAM savings. It's basically reducing 'waste' as much as possible.

Does the XsX have a version of DLSS? In the NVIDIA thread its being touted as the new holy grail of 4k.
 
Also on the PC side, will there be new standards in SSD drive from hardware manufacturers for Windows 10? New standards such as faster raw/compressed bandwidth speeds, getting rid of I/O bottlenecks, Velocityarchitecture API, Sampler Feedback streaming, and designing games around the SSD?
 

Ascend

Member
Does the XsX have a version of DLSS? In the NVIDIA thread its being touted as the new holy grail of 4k.
It does have ML support. How a DLSS equivalent would be implemented is another story. We'll have to wait and see.

Also on the PC side, will there be new standards in SSD drive from hardware manufacturers for Windows 10? New standards such as faster raw/compressed bandwidth speeds, getting rid of I/O bottlenecks, Velocityarchitecture API, Sampler Feedback streaming, and designing games around the SSD?
I think AMD's Ryzen platforms can already support DirectStorage, which is the direct feeding of the graphics card through the SSD, bypassing all the other layers. That would already make things quite a bit better. SFS would be dependent on the graphics card.

As for raw/compressed bandwidth speeds, we know the max bandwidth limit of PCIe on PCs, so, that's what SSDs will be built around. The compression is the main issue, because PCs don't have dedicated decompression blocks and do everything through the CPU. And I don't see that changing as of now. So from that aspect, maybe 16 core CPUs might have more use in the near future, since many cores would be solely for decompression. But we have to wait and see how things develop.
 

GODbody

Member
Not sure I understand what the new XSX filter is doing.

This is just my interpretation. With Bilinear filtering you would use an algorithm to determine the properties (color, transparency) of a texel, which is the smallest unit of a texture map of a polygon, using the weighted average of the properties of the 4 diagonally adjacent texels. It's able to fill in the blanks between MipMap transitions and make a textures edges appear more smooth. The Series X's texture filters changes this by giving a different weight to these averages with textures that are lower resolution, more coarse, having a heavier weight. Which should help the transitions between distant, lower resolution textures and closer, higher resolution textures appear smoother instead of sharp and reduce artifacting

It says "Slope = 100" over the left image and "Slope = 5" over the right image in that slide, so I'm guessing that they made a change in the gradient. It's hard for me to interpret exactly what it does based on that slide as well, but the Sampler Feedback Streaming patent is a bit more forgiving and could help us both understand it better. So I'll add it here.

KMBiT1y.jpg


[0042] Figures 4 and 5 are provided to illustrate some examples of the filtering processes with regards to residency map 305. Figure 4 illustrates bilinear filtering weights adjusted to sharpen the resulting filtered image. Example 401 illustrates raw data for 4x4 texels, example 402 illustrates raw data bilinearly sampled as if it were a 2048x2048 texture, and example 403 illustrates raw data bilinearly sampled as if it were a 2048x2048 texture with bilinear weights adjusted to sharpen filtering.

[0043] When filtering the residency map, a modified bilinear filter can be used to sample LOD clamp values. The bilinear weights for filtering adjacent residency map samples are adjustable to simulate the residency map being closer or equivalent to the dimensions of the PRT. Specifically, the filter weights can blend between adjacent residency map samples much closer to the (u,v) space boundaries of the rectangular regions in the PRT. Parameters could be provided in the sampler or descriptor to bias the bilinear lerp threshold by independent powers of 2 in each of U and V directions.

x8qG4Hh.jpg


[0044] Figure 5 illustrates a one-dimensional view of modified bilinear filtering for residency maps, compared to standard bilinear and raw texel value/data. Graph 500 illustrates three curves, namely raw sample value 510, standard bilinear filtering 51 1, and modified bilinear filtering 512. Element 510 can correspond to example 401 in Figure 4, element 51 1 can correspond to example 402 in Figure, and element 512 can correspond to example 403 in Figure 4, although variations are possible.

[0045] The filtering linear interpolation (e.g. lerp) value can be modified so that the entire range of the blend between adjacent samples of differing value occurs within the sample of lowest value. For example, if two adjacent residency map samples contained values 0 and 3, the blend from 0 to 1 to 2 to 3 would occur entirely within the region represented by the 0 sample. Once the sampling location crosses the boundary from 0 to 3, the resulting value should be 3.
 
Last edited:

Ascend

Member
Although science has its uses, let's not divulge into scientism. Even though very useful, science has its limits, and not everything can nor should be referred back to science.

"Instant" is anything that is perceived to be immediate to the observer. In this case, the observers are developers. So it is a term used to be from a developer's perspective, not a scientific one. I guess they are saying there is no delay between the request to load something and the time to actually start loading it.
 
If people read the FlashMap papers and looked at the context of "instant" in terms of comparison to the usual/traditional pipeline (WRT getting texture data from storage to the GPU), they wouldn't refer to the idea in such a mocking tone. It's relatively instant in terms of the low latency levels and likely the way the texture data is packaged for the GPU.

Also there was something mentioned that seemed in reference to this at Hot Chips, I'll try quoting it.

09:32PM EDT - Q: Can you stream into the GPU cache? A: Lots of programmable cache modes. Streaming modes, bypass modes, coherence modes.

This was asked during the part of the presentation speaking about SFS, and isn't referring to GPU pulling data from RAM to "stream" into the cache because there's nothing unique to that to make it worth asking as a question in the first place. I wish there was more clarification to both some of the questions and answers tho, tbh.
 

GODbody

Member
Science says 'instant' isn't a measurable quantity.
The dictionary says 'instant' refers to a very short span of time. that dirt dev talked about requesting data and receiving it withing the same frame. If they were speaking about 60 fps that's within 16 ms. Sounds fairly "instant" to me. No need to take it as literally as meaning within 1 nanosecond.
 
i wish there was a tech demo that showed off how the Xbox Velocity Architecture contributes to hi fidelity graphics with level of detail and GPU graphics, along with its higher memory bandwidth, more compute units, and fast CPU.

It seriously is a well rounded well balanced system where all of its components have been significantly beefed up.

Please dont reply with craig memes
 

sendit

Member
i wish there was a tech demo that showed off how the Xbox Velocity Architecture contributes to hi fidelity graphics with level of detail and GPU graphics, along with its higher memory bandwidth, more compute units, and fast CPU.

It seriously is a well rounded well balanced system where all of its components have been significantly beefed up.

Please dont reply with craig memes

There is:

 

Ascend

Member
I have to say it...

I GODDAMNED TOLD YOU SO



For the ones that need context, start reading around Page 50 or so.
Here's a relevant quote of one of my posts.
The idea of transferring directly from SSD to the GPU would mean transferring to the GPU cache only the portions that are purely necessary. The whole idea of SFS seems to be to have a low quality texture in place at all times in RAM, and after a high quality texture has been confirmed to be needed, you stream that texture in. Some people think it HAS to go from SSD to RAM to GPU. Some of us think it's more efficient to bypass the RAM and read it directly to the GPU cache from the SSD, since the SSD is being seen as RAM anyway. That's what seems to be the closest to what MS's marketing says.
In either case, you would have loaded that texture ages ago into RAM with the traditional way of rendering, and it might not have been used at all. That's where the bandwidth savings come from; reading and loading only what is actually needed, rather than what we suspect will be needed.
 
Ascend Ascend thanks for sharing the news!

1)My question is (i am not a programmer, software engineer), will this allow XsX/XsS to do LODS, and Billions of triangles the size of pixels similar to Unreal 5.0 tech demo from PS5?


Edit:
"The idea of transferring directly from SSD to the GPU would mean transferring to the GPU cache only the portions that are purely necessary. The whole idea of SFS seems to be to have a low quality texture in place at all times in RAM, and after a high quality texture has been confirmed to be needed, you stream that texture in. Some people think it HAS to go from SSD to RAM to GPU. Some of us think it's more efficient to bypass the RAM and read it directly to the GPU cache from the SSD, since the SSD is being seen as RAM anyway. That's what seems to be the closest to what MS's marketing says. In either case, you would have loaded that texture ages ago into RAM with the traditional way of rendering, and it might not have been used at all. That's where the bandwidth savings come from; reading and loading only what is actually needed, rather than what we suspect will be needed"

2)Is it safe to assume that the SSD itself is being used as an additional source of RAM for low bandwidth needs? The bandwidth speeds are almost equal to the original Xbox, PS2 and Gamecube, thus freeing up the MAIN GDDR6 RAM.


3) Does the PS5 GPU have direct access to the SSD as RAM same way as XsX, or does it still involve its main GDDR6 RAM?
 
Last edited:

Ascend

Member
(i am not a programmer, software engineer)
Neither am I lol.

will this allow XsX/XsS to do LODS, and Billions of triangles the size of pixels similar to Unreal 5.0 tech demo from PS5?
The billions of triangles is marketing speak. The models were built to have billions of triangles, but they are not actually being rendered with those billions of triangles. The models are dynamically scaled down to at "most" (I use this term loosely) one triangle per pixel, as to not waste rendering budget and at the same time make the model look as sharp as possible on screen.

It is still impressive that the geometry can be dynamically scaled to whatever is required. But this must not be confused with actually rendering billions of triangles.

Wasn't this known for quite some time since the Dirt 5 dev said it?
Ok. Possibly. I did not know of this until I stumbled upon this tweet.
 
Last edited:

Nikana

Go Go Neo Rangers!
Neither am I lol.


The billions of triangles is marketing speak. The models were built to have billions of triangles, but they are not actually being rendered with those billions of triangles. The models are dynamically scaled down to at "most" (I use this term loosely) one triangle per pixel, as to not waste rendering budget and at the same time make the model look as sharp as possible on screen.

It is still impressive that the geometry can be dynamically scaled to whatever is required. But this must not be confused with actually rendering billions of triangles.


Ok. Possibly. I did not know of this until I stumbled upon this tweet.

Gotcha. I believe there was debate awhile ago that this was somehow exclusive to the PS5 but was debunked in an interview with a dirt 5 dev on Xbox wire.
 

THE:MILKMAN

Member
Until otherwise stated Jason is just saying SFS is streaming data (in a pseudo just in time fashion) rather than parking data in RAM. I don't think this is what you think it is, Ascend.....

Maybe an actual dev that knows can explain it more thoroughly and in terms we all can understand?
 

BlueHawk357

Member
I have to say it...

I GODDAMNED TOLD YOU SO



For the ones that need context, start reading around Page 50 or so.
Here's a relevant quote of one of my posts.

Interesting, but I do wonder how this will benefit XSX. I understand it clears up RAM, but what affect can we see from that, honestly? Hopefully it frees up enough RAM to let us have a 4K UI 😂
 

Ascend

Member
2)Is it safe to assume that the SSD itself is being used as an additional source of RAM for low bandwidth needs? The bandwidth speeds are almost equal to the original Xbox, PS2 and Gamecube, thus freeing up the MAIN GDDR6 RAM.
Basically, yes. It brings more complexities for programming, but Microsoft apparently believes the benefits outweigh the drawbacks.

3) Does the PS5 GPU have direct access to the SSD as RAM same way as XsX, or does it still involve its main GDDR6 RAM?
I am not sure. Microsoft's info always pointed in this direction. The PS5 info has been less detailed, and to be honest, I didn't focus as much on its architecture anymore after I got tired of the console warring on here. I focused more on the PC side of things (RTX3000 & RDNA2). I don't know what additional info has been provided by Sony regarding the PS5.
If things stayed generally the same, I would say that the PS5 can refresh its RAM faster than the XSX. I lack information on the PS5 to be able to say whether it can bypass RAM and directly feed the GPU.

Until otherwise stated Jason is just saying SFS is streaming data (in a pseudo just in time fashion) rather than parking data in RAM. I don't think this is what you think it is, Ascend.....

Maybe an actual dev that knows can explain it more thoroughly and in terms we all can understand?
It makes no sense to add additional latency by "streaming" the data through RAM first.
Imagine you have a bike and a car. The bike has a box that needs to be delivered to the destination, and the car is empty. The car and the destination are both 1 mile away from the bike. It would make zero sense to go to the car first, just because the car can go faster. You will always be slower than simply going directly to the destination on the bike.

"Without putting it directly in memory" to me sounds like the data on the SSD being transferred directly to the GPU cache, and if the data in the GPU cache needs to be overwritten for whatever reason, the data is 'downgraded' (i.e. copied) to RAM instead. So the data can be put into RAM after it has been used by the GPU, as to avoid having to read/stream/fetch it again and again from the SSD.

He means streaming vs preloading. Nothing about GPU caches mentioned there .
Where else are you going to store the data if you're not "putting it directly in memory"?
 
Last edited:

THE:MILKMAN

Member
It makes no sense to add additional latency by "streaming" the data through RAM first.
Imagine you have a bike and a car. The bike has a box that needs to be delivered to the destination, and the car is empty. The car and the destination are both 1 mile away from the bike. It would make zero sense to go to the car first, just because the car can go faster. You will always be slower than simply going directly to the destination on the bike.

"Without putting it directly in memory" to me sounds like the data on the SSD being transferred directly to the GPU cache, and if the data in the GPU cache needs to be overwritten for whatever reason, the data is 'downgraded' (i.e. copied) to RAM instead. So the data can be put into RAM after it has been used by the GPU, as to avoid having to read/stream/fetch it again and again from the SSD.

He's PR'ing the hell out of his tech! Here is the blurb direct from Microsoft about SFS:

Microsoft said:
Sampler Feedback Streaming (SFS) – A component of the Xbox Velocity Architecture, SFS is a feature of the Xbox Series X and Xbox Series S hardware that allows games to load into memory, with fine granularity, only the portions of textures that the GPU needs for a scene, as it needs it.

You still haven't explained, if the RAM is being bypassed, how GB's of data fits in KB's to MB's of GPU cache? It doesn't make sense to me but I'll continue to hope a dev will chime in and explain all in a clear way we all can understand.
 
Top Bottom