You should understand that's it's a tongue in cheek. Obviously it can run much better. We don't know yet.
It's hard to tell through text-only, at least for me
This isn't necessarily about XvA in particular and it's utilizing a multi-hardware setup that obviously can't be directly applicable to a singular video game console, but the following does give some indication of the type of bandwidth Nvidia is getting via GPUDirectStorage (basically their implementation of DirectStorage) on 4x DGX2 systems:
So they're hitting peaks of around 168 GB/s raw bandwidth on the SSD I/O with 4x DGX-2 systems. Cut that down by four and it's 42 GB/s raw bandwidth for a single DGX-2 system. Each DGX-2 has 16 Nvidia Tesla v100s, and each of those is about 14 TF on Volta architecture (which is older than Turing). Honestly though the GPUs aren't important here because we're just looking at SSD I/O capability with GPUDirectStorage in mind.
Each DGX-2 comes with 8x 3.84 TB SSDs. 42/8 gives 5.25 GB/s raw bandwidth for each v100's SSD. However if you look at the SSD they actually use, Micron 9200, that provides actual raw bandwidth of 3.5 GB/s, so physical peak in this setup would be 112 GB/s. However, it's more likely that since this is data NASA is using for simulation and would be uncompressed, they are using 4 additional Micron 9200s because 42/12 gets you 3.5 GB/s.
Unfortunately that doesn't really tell us much about compression capabilities, just that GPUDirectStorage (and therefore DirectStorage) is very scalable and networks really well with clusters of machines. Which will have obvious benefits. I'm hoping I can find some figures of GPUDirectStorage tests with compressed data being ran because it would be relatively easy to do a bit of work to figure out how that could translate to Series X SSD I/O performance at compressed rates (provided those tests are using all the same features as XvA; DirectStorage is just one part of XvA).
Thanks as always for a good discourse!
The only statement I disagreed with is the bolded one above. From the information I have seen and received the PS5 actually delivers both higher bandwidth and lower latency in terms of I/O and that the difference is actually significantly higher when in comes to latency between the two platforms than in bandwidth (which in turn means that the practical bandwidth difference is very much in favor of the PS5 when reading 100's of files such as textures since latency then dominates the use-case). The reason for this is two-fold - 1) limited to no overhead in reading and decompressing textures from the SSD into RAM due to dedicated silicon for both steps and 2) cache scrubbers that allows faster move of data from RAM to the GPU cache. The XSX has AFAIK CPU/driver overhead in reading the textures into RAM after decompression and a more standard PC solution in moving data from RAM to the GPU cache (my two sources for this is public information and dialogue with one developer that have direct access to one dev kit and indirect access to the other dev kit - please note that *he takes the NDAs seriously but have made a few remarks regarding publicly available information and how it relates to what *he has seen).
May I ask what you base your statement on?
Edit: Missed one word!
Well to address what you're bringing up regarding, I guess the best way I can put my perspective on it is like so:
Yep, we've known for a while about PS5's dedicated central processor for moving data in/out of RAM. But I think what isn't being considered there is, it still has to contend with other processors along the bus when doing this. I actually think this is another reason they need the cache scrubbers, because if the dedicated processor in the I/O block is DMA'ing to the memory bus for read/write operations to RAM, CPU, GPU etc. will have to wait their turn, as to be expected with hUMA architectures. And to try cutting down on the GPU's need to then spend more cycles of, instead of waiting for access back to the bus, then copying through data from RAM to the caches wholesale after getting privileges back after already waiting, the cache scrubbers are there to cut down that time period. Hence selective eviction of info within the GPU caches.
So the Series systems might not have the cache scrubbers (they may or may not have some equivalent to it, not necessarily the ECC-configured memory that's already been mentioned which would serve a different role anyway), but part of the reason their design doesn't require it is because at least in regards to CPU-bound tasks, they don't need to wait while a dedicated I/O processor does its thing to get access to the memory bus. So CPU-bound game logic can still access the bus if it needs to. Maybe not at the full bandwidth of the slower pool of GDDR6, but the capability is there. And keep in mind it's the same OS core on the Series systems handling that task, and we know what kind of CPUs these systems are using. Do we actually have information on what exactly the PS5's I/O dedicated processor is? Is it a repurposed Zen 2 core? If so, how? Like as in is it cut down on local caches (which might also explain reason for the Cache Coherency Engines if it comes to that)?
Overall I think the estimate of overhead incurred by the Series systems for what you're describing is a bit much; keep in mind I believe MS were already aware of this and therefore that could've been a factor in them clocking their CPUs higher, to account for such overhead, whatever much it may be. You also have to keep in mind they are not literally dropping some PC version of Windows 10 into the system and leaving it there. Whatever overhead you might associate with W10 (speaking of you can always cut it down to even under 512 MB if you really wanted, granted you lose out on a lot of features), you can't really automatically associate with Series X because they already use their own OS, Xbox OS, that's built specifically for the console, even if it leverages Windows tech.
I don't know what you mean by "standard PC-like" solution regarding how Series X is moving data to/from RAM to/from storage. DirectStorage hasn't actually been fully deployed in the PC space yet, and other parts of XvA won't even be available on PC for a while. Nvidia GPUs support their own implementation of DirectStorage called GPUDirectStorage, and it appears to be extremely good (you can check out the clip I linked in this post replying to psorcerer). If that is "standard PC implementation", then it doesn't really matter much if it works well. MS are simply leveraging their strengths here, same as Sony with theirs, but one other thing to keep in mind is that MS are also using their developments with the Series systems to then leverage in other markets they're involved in, such as PC, mobile, server/data center and more.
Similarly I don't see where you are hearing the latency on PS5 side is leagues better. Sony actually haven't talked much of anything about latency in any aspect of their system. I don't doubt they have low latency, but I think MS have simply prioritized this moreso, while Sony have prioritized bandwidth moreso. One factor that benefits MS in latency, as I said before, is that they're using faster NAND devices with larger storage. You usually get higher bandwidth per NAND module with the bigger modules, and latency figures also increase. If you'd like some thought on what perhaps could be providing them an advantage in latency figures (plus maybe some speculation on some parts to XvA particularly the "100 GB pool partition", I'll quote a couple of insightful posts I read over on B3D:
function:
I've been wondering about the "Veolcity Archtecture" and MS's repeated mentions of low latency for their SSD - something that's been sort of backed up by the Dirt 5 technical director. There's also the talk of the "100GB of instantly accessible data" aka "virtual ram".
Granted I could be reading too much into some fairly vague comments, but I think there's probably something to the comments, and also that the two things are possibly related. So I think that maybe one of the key things that that allows MS to have such (presumably) low latency from the SSD is also responsible for the strange seeming "100GB" figure.
Now I'm assuming that the "virtual memory" is storing data as if it were already in, well, memory. So the setup, initialisation and all that is already done and that saves you some time and overhead when accessing from storage compared to, say, loading assets for an SSD on PC. But this virtual memory will need to be accessed via a page table, that then has to go through a Flash Translation Layer. Normally this FTL is handled by the flash controller on the SSD, accessing, if I've got this right, a FTL stored in either an area of flash memory, or in dram on the SSD or on the host system.
XSX has a middling flash controller, and no dram on the SSD. So that should be relatively slow. But apparently it's not (if we optimistically run with the comments so far).
My hypothesis is that for the "100 GB of virtual ram" the main SoC is handling the FTL, doing so more quickly than the middling flash controller with no dram of its own, and storing a 100GB snapshot of the FTL for the current game in an area of system reserved / protected memory to make the process secure for the system and transparent to the game. Because this is a proprietary drive with a custom firmware, MS can access the drive in "raw mode" like way bypassing all kinds of checks and driver overhead that simply couldn't be done on PC, and because it's mostly or totally read access other than during install / patching, data coherency shouldn't be a worry either.
My thought is that this map of physical addresses for the system managed FTL would be created at install time, updated when wear levelling operations or patching take place, and stored perhaps in some kind of meta data file for the install. So you just load it in with the game.
And as for the "100GB" number, well, the amount of reserved memory allocated to the task might be responsible for the arbitrary seeming 100GB figure too.
Best I could find out from Google, on a MS research paper from 2012 (
https://static.usenix.org/events/fast12/tech/full_papers/Grupp.pdf), was that they estimated the FTL might be costing about 30 microseconds on latency. Which wouldn't be insignificant if you could improve it somewhat.
So the plus side of this arrangement would be, by my thinking:
- Greatly reduced read latency
- Greatly improved QoS guarantees compared to PC
- No penalty for a dram-less SSD
- A lower cost SSD controller being just as good as a fast one, because it's doing a lot less
- Simplified testing for, and lower requirements from, external add on SSDs
The down sides would be:
- You can only support the SSDs that you specifically make for the system, with your custom driver and custom controller firmware
- Probably some additional use of system reserved dram required (someone else will probably know more!)
dsoup:
I can't offer much insight because, as you said, these are thoughts based on a number of vague comments and much of my comments where about the Windows I/O stack which is likely very different than Xbox Series X, but it would indeed be truly amazing if Sony have prioritised raw bandwidth and Microsoft have prioritised latency.
My gut tells me that if this is what has happened that they'll largely cancel each other out except in cases where one scenario favours bandwidth over latency and another favours latency over bandwidth. Nextgen consoles have 16Gb GDDR6 so raw bandwidth is likely to be a preferable in cases where you want to start/load a game quicker, e.g. loading 10Gb in 1.7 seconds at 100ms latency compared to 3.6 seconds at 10ms latency. Where the latency could make a critical difference is frame-to-frame rendering and pulling data off the SSD for the next frame, or the frame after.
At the end of the day, I can't claim I have some connections to actual developers. Yes I've talked to developers but it's been mainly through this forum and in public posts, and a couple in private, discussing the next-gen systems. One in particular we've ultimately ended on a pretty stark disagreement because a few things they mentioned didn't add up with actual publicly available information and it was just too often that kept happening with them.
However, I've definitely taken the time to read upon and research so much of this stuff, it's not even funny. Because I like trying to make sense of all of this. I'm also the kind of guy who likes forming their own opinion by listening to as many well-reasoned perspectives as possible even if they come in conflict with each other on where they stand here or there. That helped a lot when discussing the GPU leaks, and I think it's helping a lot here, too.
Trust me, if you can name a particularly technically well-reasoned person here, on Era, Beyond3D wherever, or the insiders, data miners through Twitter (Rogame, Komachi etc.)...chances are I've heard of them and seen what they've had to say. And hard data too, of course
Actual developer statements, too, even the controversial ones like the Crytek guy's. I've always seen merit in all the stuff they've presented forward even if there are parts I either don't understand at first or don't agree with. The fun's in taking all of those points and trying to see what's all correlating together, however disparately, and how.
And from that I try forming my own perspective on it, even if it's in flux on parts. I'm no expert on this, none of us are, and we all have our preferences when it comes to certain technological features, standards, methodologies, systems, architectures etc. shaped from first-hand experience, education, learned knowledge and critical thinking & logical reasoning.