I believe the NXP chip was an interesting possibility, but BC does kind of throw a monkey wrench into that idea. Could Nintendo have just bought the IP from Macronix at some point and be free to make their own changes?
Yep, it's even possible that Nintendo owned the IP outright from the start, as it was apparently a highly customised design.
This definitely leaves some more possibilities open. It did somewhat irk me that at if Nintendo stuck to clean multipliers, it would seem to arbitrarily keep some clocks extra low (RAM, CPU mainly), and only because you're working with the least important part of your system (DSP) as a base?. I disagree with your assessment that there are 640 ALUs on the GPU (I think half that more likely), but when I lowered my expectations in that regard, I figured they'd be able to crank the clock slightly higher (close to 600 Mhz). Just to grab a number out of the air, 575 Mhz with 320 SPUs would get you to 368 GFLOPS, just about 1.5x Xenos.
As already mentioned, 320 would result in a much smaller die, even with massive eDRAM overhead and some seriously transistor-intensive customisation of the GPU architecture. It's possible that we're looking at 480, if Nintendo have added a lot of transistors in there. Though I still think a RV740 base is likely;
have a look at this RV770 die shot (I couldn't find one for the RV740). Removing the GDDR5 and PCIe interfaces from the R700 architecture frees quite a bit of space on the die.
The system RAM, I'd run as high as possible (800 Mhz) and the CPU twice that for a clean 1.6 Ghz. As for the eDRAM, I'd still be surprised if it wasn't running at the same speed as the GPU, but perhaps it's also running at 800 Mhz in order to match the bandwidth of the main memory pool (assuming there is a 128-bit connection from CPU to GPU). Just a huge guess here, but wouldn't it be better for the eDRAM to have to wait around for the GPU a few extra cycles (and it can even be serving the CPU in that time) than the other way around?
My point on the identical speeds of the DDR3 and eDRAM is that you could have a single memory controller, operating at 800MHz, controlling access to both. Thus the CPU could just have a bus to that memory controller, rather than two separate busses for MEM1 and MEM2. Similarly the ARM cores and DSP could run all memory access through the same controller.
On the synchronised clocks side of things, there's very little reason for the CPU and DDR3 to be highly synchronised. Total random access latency for DDR3 is somewhere around 50ns. If you have CPU and RAM on asynchronous clocks, you increase that latency to perhaps 51ns. It's not a big enough difference to make artificially reducing your CPU clock worthwhile.
With eDRAM, though, the benefit is (largely) in the very low latencies involved. If you had eDRAM which operated at single-cycle latency, then asynchronous clocks between GPU and eDRAM could double that to two cycles (even if the eDRAM's operating at a higher clock), which is a significant increase. This isn't going to be much of an issue for the eDRAM's use as a framebuffer, which is largely concerned with high-bandwidth writes, but once you start pushing compute loads to the GPU it could certainly become an issue.
Consider, for example, Unreal Engine 4. UE4's main innovation is a lighting system based on sparse voxel octree global illumination (SVOGI). The final part of SVOGI consists of running cone-traces over the octree to determine the second-bounce illumination over the scene. This is heavily latency-bound code which is intended to be run on the GPU, and it's occurred to me recently that, if it is true that Epic have decided to support Wii U with UE4, it's likely because they've figured out a way to keep chunks of the octree in the eDRAM during the cone-traces, benefitting from the incredibly low latency the eDRAM provides. Asynchronous GPU/eDRAM clocks would give extra bandwidth for traditional GPU tasks, but it would significantly hinder latency-bound GPU code like this, which is likely to become more and more common as the generation progresses.
What currently stumps me with regard to the GPU's connection to the eDRAM is the fact that some games slow down in scenes with heavy alpha blending. Given that I expect even the worst ports to put their framebuffers into the eDRAM I wonder what might cause this.
Is it possible that this is something as simple as a poorly-optimised API?
Edit: But perhaps we have evidence to the contrary. Wii U games are reportedly experiencing slowdown in scenes utilizing alpha blending. Might this point towards a slower connection than the one between Xenos' eDRAM and ROPS?
The bandwidth between the GPU and daughter die is much more important in Xenos, it's the bottleneck that Wii U's eDRAM would have to exceed. Based on my calculations from above (and assuming the link between GPU and ROPs/eDRAM is 32GB/s on Xenos, which I know is disputed), the worst case scenario is a 60% increase in bandwidth over that.