ROPless GPUs and Transactional Memory
For those who haven't read it, Popstar posted this over in the technical discussion thread:
*Random thinking out loud probably not related to the actual Wii U GPU*
If you have all that memory embedded right on the GPU and accessible to the shader units with low latency, do you need conventional ROP hardware at all? Or can you just do blending in the shader like a PowerVR / Tegra chip? Perhaps with mini-rops for Z / stencil test?
Someone later posted a link to
this blog post, which explains why blending in shaders is usually a disaster. As my (limited) understanding of this goes, there are pretty much two things a GPU needs for blending in shaders to work.
The first is a low-latency memory pool large enough to hold both the framebuffer and Z-buffer. Wii U has this in the 32MB MEM1 pool of eDRAM.
The second, and I believe more important, aspect is that the GPU needs a fully transactional interface for this memory pool. (For a description of what I mean by transactional, have a read through
this Ars article.) A GPU with blend shaders is an almost perfect example of the problem that transactional memory is designed to resolve, as there are a large number of units performing operations on a common memory pool, and the small granularity of the data access makes a conventional locking scheme almost completely infeasible. In the Wii U even more so, as there'd also be the extra issue of three CPU cores contending for access.
So, how would a transactional memory interface for the eDRAM be implemented? In the BlueGene/Q chip it's implemented in a shared cache, but that isn't strictly necessary, and all we actually need is a buffer. This buffer would operate in a (relatively) simple manner. Every time a thread starts an atomic op on the eDRAM, all reads and writes within that op are kept in the buffer. When the atomic op is finished, the buffer logic tests to see whether the data have been changed, and if they haven't it commits the writes. If they have, it cancels the op and tells the thread to retry.
Because the eDRAM is just 32MB, and the framebuffer shader operations would be operating on very small pieces of data at a time, a transactional buffer on Latte wouldn't actually need to be very big, but it would need to be very fast. Latte has an absolutely
perfect candidate for this in the 1MB of SRAM up in the left corner. Block A (and possibly B) would house the logic necessary. The transactional buffer could handle MEM1 access for texture units, pixel shaders, blend shaders, CPU cores, and possibly even the ARM and DSP, and should be able to do so with a near-negligible increase in latency. In fact, given the SRAM is already there for BC, such a transactional interface would already be useful for Latte even without blend shaders, given the potential difficulty of managing MEM1 between so many components.
Now, there is one issue with the notion of blending via pixel shaders, and it's this: if someone had solved the blending via shaders problem, wouldn't you expect the GPU they produce to have more shader bundles as a proportion of the die area? We, however, seem to have a
lower shader to die area ratio than we would have expected. In that case, it seems like we'd be looking at special blend shader units distinct from the pixel shaders.