could the GPGPU make up for the lack of new SIMDs?
Lack of SIMD units in the CPU? Yes, sort of, or at least that seems to be Nintendo's intent. It is the case nowadays (as I said in the other thread) that when it comes to streaming SIMD-heavy code, GPUs handily beat CPUs in the Gflops/watt and Gflops/mm^2 metrics, which are metrics you want to optimise when designing a games console with a strict budget and a strict power/heat envelope to stay in. That said, I think that it's still worth having some SIMD functionality on the CPU to handle code that wouldn't run well on the GPU but would still benefit from proper SIMD support. For that reason I assumed that Nintendo would add a SIMD unit (or an A2-style FPU/SIMD hybrid) to just one of the cores, to handle those kinds of tasks, although it seems that this isn't the way they've gone. We'll have to wait and see whether that was a good idea or not. That said, Blu posted a test of Broadway's capabilities running SIMD-heavy code a page or two ago, and it actually didn't perform as badly as you would have expected (and that's without even using paired singles), so it might not be as big a deal as I'd been assuming.
Also, I'm actually starting to think that the crazy 550GB/s eDRAM bandwidth might not be
quite as crazy as I'd originally thought. I was reading through a description of XBox360's memory subsystem, and it occurred to me that the 256GB/s bandwidth from ROPs to eDRAM wasn't actually overkill in the way I was assuming. I thought that it was simply a matter of a 4096 bit interconnect being the only possible configuration of 10MB eDRAM at the time, and hence MS just went with it and only used a few tens of GB/s of it. In fact, it's the other way round. The 256GB/s is the maximum theoretical throughput they calculated the ROPs would use up, so any less than that could become a bottleneck (and in this particular case, a 4096 bit interconnect isn't really all that expensive).
That brings me to the Wii U. Let's just take as an assumption that my block diagram on memory access was correct, and that we're looking at a 420:24:12 configuration of the GPU (that's SPUs:Texture units:ROPs, in Wikipedia numbering). So, there would be three ROP "bundles", six texture unit "bundles", and each texture unit bundle would be aligned with an array of 70 SPUs. So, if the XBox360's 2 ROP units (8 ROPs by Wikipedia numbering) require a 2048 bit interconnect each without bottlenecking, then surely Wii U's 3 ROP units would require the same? That comes to 6144 bits of interconnect being taken up
just by the ROPs to keep per-ROP parity with XBox360. Then we've got the LSUs attached to the SIMD arrays, of which we've got six. Let's say you want to give each of them a 256-bit connection, which isn't so crazy when you think each one's feeding 77 GFlops of processing power, and you're up to a total of 7680 wires coming from the eDRAM. Add a 512 bit connection to the CPU (which is feasible with the GPU and CPU on an MCM) and you're up to a 8192 bit wide total interconnect and 550GB/s of theoretical bandwidth. Of course the total 550GB/s would never be reached in the real-world (not even close), but it might need to be that high to give each component the bottleneck-free connection it requires.
You'd then have a distribution of bandwidth that looks like this:
ROPs: 2048bit / 137.5GB/s x 3 = 6144bit / 412.5GB/s
SPUs: 256bit / 17.2GB/s x 6 = 1536bit / 103.1GB/s
CPU: 512bit / 34.4GB/s
Of course, I wouldn't necessarily bet on the 8192 bit scenario being true (for example you could divide all those numbers by two to get a 4096 bit scenario which might work, for all I know), but after reading up on the XBox360 a bit, I can no longer rule it out completely.
Anyway, after a year and a half of speculation and discussion with you fine folks, I finally have a Wii U sitting in a box next to me, and I'm about to head home to plug it in and start playing. So you'll have to excuse me if I start posting a whole lot less for the next few days