Given how closely Nintendo and Nvidia worked on this API, I wouldn't be shocked to find out if cache was customized towards minimizing the need to move data in and out of main memory for additional post processing subpasses. A sizable L3 cache for the GPU and a larger L2 cache for the CPU would save a lot of bandwidth.
Oh, I agree, and I do certainly expect some kind of customised cache for this reason, whether it's an increased GPU L2 or a shared L3 or whatever. Given Nintendo's historical desire to keep framebuffer accesses on-die, and Nvidia's use of tile based rendering from Maxwell onwards, it would seem a natural assumption.
The issue that was brought up (by blu and Durante I think?) earlier in this thread is that, under pre-vulkan APIs, at least, intermediate buffers like g-buffers can't be tiled, regardless of the GPU or cache configuration, because you have to implement them as render-to-texture. This requires the full buffer to be rendered and pushed to memory before it can be read, and because a shader can access any pixel of the texture while reading you can't tile the reads.
I sort of ended up answering my own question with vulkan's renderpasses and subpasses, as it offers an alternative to render-to-texture (or more specifically a replacement, as I don't believe you're supposed to use render-to-texture at all in vulkan) which can be efficiently tiled. By treating the g-buffer (or any kind of intermediate buffer) as an attachment within the renderpass and restricting shaders to only accessing data from the pixel its operating on (potentially including the same pixel across multiple attachments), then you get a g-buffer which can be tiled.
The subpasses can then define exactly when a given attachment is needed (or not needed), so the GPU can efficiently allocate a sufficient amount of cache in advance for a given tile, and can evict a g-buffer as soon as it's no longer needed to free up that cache. Effectively, a deferred renderer could keep the bulk of the render pipeline (including lighting) as subpasses within a single renderpass, meaning that for a tile-based GPU (such as Switch's) this can all be done on one tile at a time, with the entire process (creating multiple g-buffers, calculating lighting and shading, etc., etc.) being completed on that tile within cache without having to touch main memory at all until you have a near-final color buffer. Post-processing, screen-space reflections, etc. would still be done in a non-tiled manner, but you'd get basically the full benefit of TBR for a deferred renderer, which you can't do with any other API (even DX12).
Those tiles can be of any size, they can be run concurrently or consecutively, or even on different GPUs (in theory, anyway). Most importantly, though, it avoids any need for hardware-specific extensions, which I imagine Nintendo would want to avoid (both to improve compatibility with third-party engines and to keep their options open for future Switch hardware). If a third-party dev (let's say id) have an existing vulkan renderer that's well-implemented (which I imagine id's is) then it should tile well on Switch right out of the box without id having to mix up their rendering pipeline just to accommodate Switch's GPU.
From Nintendo's point of view, then, all they need to do is make sure the GPU's tiling algorithm fully exploits vulkan's renderpasses (I don't see why Nvidia would screw this up) and then make sure there's a large enough cache to accommodate tiles which include, potentially, a color buffer, z-buffer, multiple g-buffers, and possible even extra buffers for transparencies, without the tiles being too small.
If the article's wrong, then Thraktor's wrong, btw, since it was based on Thraktor's thread. I'm reading Thraktor's thread and still reading posts but it has 200+ pages and it's not easy going through all of them to reach the definite conclusion. Some here prefer to troll instead of pointing me to the proper answer. "SMD IS SHIT!!11!!1" is not an answer.
Can you please be a kind soul like a couple here and point me to that post that proves this article wrong so I can put this to rest?
I was wrong. The OP was based on initial assumptions, and after a while I was too busy to keep on top of the thread so I stopped updating the OP, but the 176Gflops is based on actual Nintendo documentation for Wii U, as AzaK says above, so there's not really any scope for it to be wrong.