Alright, I've been doing a few calculations around the die size. The total die is 146.48mm˛, and of that about 50.78%, or 74.38mm˛ is what I'll call "GPU logic" (that is, everything except the memory pools and interfaces). Now, looking at Durante's image from the first page:
Originally Posted by Durante
Let's assume, for a minute, that the four blue sections are TMUs, the red sections are shader clusters, and the two yellow sections down at the bottom right are ROP bundles. This would, we assume, produce a core configuration of 320:16:8. Now, if you measure out the sizes of these, you get only 28.9% of the total GPU logic space, just 21.48mm˛. What's going on with the other 52.9mm˛ of GPU logic? There's probably a DSP and ARM on there, but that accounts for a couple of mm˛ at most.
There are basically two possibilities here:
- The 320:16:8 core config is accurate, and there's ~50mm˛ of custom logic or "secret sauce" (more than twice the size of the conventional GPU logic).
- The 320:16:8 core config isn't accurate.
Here's the interesting thing about the second possibility, it challenges one assumption that has gone unquestioned during our analysis; that all shaders are equal. What if they aren't?
What if Nintendo has gone for an asymmetrical architecture? What if they've decided that some of the shaders will be optimised for some tasks, and some for others? This doesn't necessarily require a complete reworking of the shader microarchitecture, and it could be as simple as having different shader clusters with different amounts of register memory. The ones with lots of register memory would be suited for compute tasks (we can assume that these are the red ones) and the others could be dedicated to graphical tasks with low memory reuse (the blue squares above the red squares might be a fit for these).
Why would I think Nintendo would do something like this? Well, for one, they've done exactly the same thing with the CPU. Although this is pending the CPU die photo, it appears that Nintendo have gone with three identical cores with very different amounts of cache, with two of the cores getting 512KB of L2 each, and the other core getting 2MB of L2. The assumed reason for this is that different threads are naturally going to have different cache requirements, so if developers are writing code specifically for the hardware, they can run cache-intensive threads on the cacheful core, and less cache-intensive threads on the other cores. The logic would be the same in this case. Not all GPU threads are created equal, so why give them all equal resources? Why not have register-heavy shader bundles for compute tasks to run on, alongside register-light shader bundles for other tasks?
I don't know as much about texture units and ROPs, but could the same principle be applied to them? Might there be different texture units specialised for different tasks? Could we have asymmetrical ROPs, for instance with some specialised for the Gamepad?