Intersting, thanks for your detailed response. If you dont mind humoring my noobiness, I'm curious what areas of performance you expect these customizations to improve upon over the vanilla TX1 (besides the CPU HMP since you explained that pretty clearly).
Well the wider memory bus is pretty simple, as the wider the memory bus is the higher the bandwidth (i.e. we're looking at either a 64 bit bus giving 25.6GB/s or a 128 bit bus giving 51.2GB/s).
The other two are to do with how efficiently that memory bandwidth is used. Nvidia's Maxwell and Pascal GPUs use a technique called tile-based rendering, or TBR, where the screen in broken down into individual tiles, which are rendered one at a time, rather than rendering the entire screen in one go. The benefit of this is that you can keep the tile you're working on in cache right on the GPU die, meaning that the GPU doesn't have to make loads of bandwidth intensive accesses to main memory while it's working on the tile, but instead can work on it in the cache and just send the finished tile to memory once its done.
More GPU cache makes TBR more effective, as it means you can use larger tiles, and still potentially have plenty of cache left for other uses (the GPU L2 cache also has to cache textures and any other data being fed into the GPU).
The problem with TBR, though, is that traditionally it has only worked well with "forward rendering". That is, graphics engines that perform pretty much all operations directly on the final framebuffer. Most modern graphics engines, though, are moving towards "deferred rendering", where they use intermediate buffers (called g-buffers) to store data about the scene and then only at the end do they use this data to create the final framebuffer. These don't tend to work as well with TBR because, under DirectX or OpenGL, g-buffers operate in a way which can't really be tiled, so you don't get the same level of bandwidth savings compared to a forward renderer.
Vulkan changes this up, though, due to the way it organises the graphics pipeline into what it calls renderpasses and subpasses. This is implemented in a way which allows g-buffers to be properly tiled by the GPU, potentially providing significant bandwidth savings for an engine which uses deferred rendering. For a system like Switch, which has relatively limited main memory bandwidth, but features TBR, having this fully implemented could be very beneficial, so it's the kind of thing which Nintendo should be looking at to get the most out of the hardware.
Look what Marcan got with " a razor blade, a DSLR, and a $100 microscope". We might not need them, as awesome as they were last time, if they don't want to give it out this time.
https://twitter.com/marcan42/status/803281643750363136
I'm kind of tempted to grab a "for parts" Shield TV or Pixel C to try this myself on the TX1, both for fun and because it would give us a point of comparison for a Switch die shot. There don't seem to be any available either locally or on eBay, though, plus there's a good chance that I would screw it up and mangle the chip, given I've never done it before.
Thanks for the link, there's some fantastic shots in there (plus some nice infra-red photography a couple of pages in). He's also got
a youtube video on how to do it. Hmmm....
Any techies here that know a good way of measuring memory bandwidth performance? If the Switch is operating at the clocks it is, I don't see why games won't be 1080p when docked. The Shield TV is pulling off 1080p 8xAA in Unity pretty comfortably! And this is using the Vulkan api too.
Using the same Vulkan API, same scene and same render quality settings (1080p 8x MSAA).
PC:
i5 4690K @ 4GHz, GTX 970 (3.5 TF):
284 FPS
Shield TV:
Clocked limited to 1GHZ CPU, GPU fluctuating between 614MHz and 1GHz, averaging about 768MHz most of the time:
44 FPS
Frame rate is barely fluctuating between 1 and 2 FPS on both platforms. Extrapolate that data between those platforms to get what it would perhaps be on the Xbox One and PS4, to see how far or how close the Switch might be? Hah!
It would be worth testing out
Sascha Willems's Vulkan examples, specifically the deferred rendering ones (as they're likely to be the most bandwidth intensive). The
standard deferred shading example is probably reasonably representative of a modern, reasonably well implemented Vulkan engine, whereas the
deferred shading and shadows example should be more bandwidth intensive again. He's got pre-compiled binaries for Android (plus Windows and Linux)
here, so there shouldn't be any trouble getting it to run.