• Hey, guest user. Hope you're enjoying NeoGAF! Have you considered registering for an account? Come join us and add your take to the daily discourse.

WiiU technical discussion (serious discussions welcome)

Thraktor

Member
I went with 15 GFlops/Watt as the top-end of what I could imagine a customized and improved Turks GPU delivering, which got me to the 450 GFlops upper limit. The point was to show that the pre-launch "3x" rumours for the GPU aren't really viable even under ideal assumptions, given what we now know about die sizes and power consumption.

Well, sort of. If it's a 40nm part, then a RV740 640:32:16 configuration at about 480MHz should give a greater gflops/w ratio than the Turks Pro 480:24:8 at 650MHz.

Alternatively, if we want to talk about "ideal" assumptions we have to consider the fact that we don't know what process the GPU is made on. The most likely bet is certainly 40nm, but 28nm isn't completely out of the question, and a wide R700-based chip on a 28nm process operating below 500MHz would give much better efficiency than any of AMD's GCN-based GPUs, which are more transistor-intensive and are all clocked in the 800MHz-1GHz range. A 28nm chip is unlikely, yes, but it can't be ruled out without hard evidence to the contrary, and any ideal case would have to take it into account.
 
Well, sort of. If it's a 40nm part, then a RV740 640:32:16 configuration at about 480MHz should give a greater gflops/w ratio than the Turks Pro 480:24:8 at 650MHz.

Alternatively, if we want to talk about "ideal" assumptions we have to consider the fact that we don't know what process the GPU is made on. The most likely bet is certainly 40nm, but 28nm isn't completely out of the question, and a wide R700-based chip on a 28nm process operating below 500MHz would give much better efficiency than any of AMD's GCN-based GPUs, which are more transistor-intensive and are all clocked in the 800MHz-1GHz range. A 28nm chip is unlikely, yes, but it can't be ruled out without hard evidence to the contrary, and any ideal case would have to take it into account.



Well, I doubt a RV740 at 480Mhz would draw something like 30W.
 

Thraktor

Member
Well, I doubt a RV740 at 480Mhz would draw something like 30W.

When you consider that (a) the 4770 has energy-sucking GDDR5 and other assorted sundries which the Wii U's GPU doesn't and (b) energy consumption has a convex relationship to clock speed*, it might not be that far off. I wasn't choosing it as a specific claim of what's in the Wii U, though, just as an example of something with similar performance to Turks at a lower TDP.

*Technically energy consumption is more directly related to voltage, but the statement is true in general for a device which isn't under/over-volted.
 

OryoN

Member
I think if Nintendo wasn't that obsessed with so low power consumption, Wii U could have been more an improvement.
I mean, it would kill people if the console ate something more like 60-70W instead of that 30-40W thing.

In a perfect world, yeah, 20-30 more watts couldn't have hurt, but it would also mean that they have to make concessions elsewhere. Form factor has become a selling point for Nintendo these days also, and so they obviously felt strongly about that, even though some of us don't care. It's funny, because some journalists went as far as to call the console "huge". I shudder to think what PS4/Xbox Next will look in their eyes.

With all other elements being the equal, more power means = more silicon = more heat = bigger cooling system = bigger console... all at more $. According to Nintendo, they believe they've found the perfect balance on a performance per watt @ X size & pice basis. Of course, you can't blame anyone for wanting more, but based on those factors, what the Wii U does with "so little" is already pretty impressive, and will only get better as the console is pushed.
 

AzaK

Member
The only thing that still annoys me is that it seems like no one seems to have measured the power consumption with a larger variety of games. Since it's a more modern GPU than in the other consoles, the difference between low-load power consumption and high-load power consumption could also be greater.

I want to know this too.

In a perfect world, yeah, 20-30 more watts couldn't have hurt, but it would also mean that they have to make concessions elsewhere. Form factor has become a selling point for Nintendo these days also, and so they obviously felt strongly about that, even though some of us don't care. It's funny, because some journalists went as far as to call the console "huge". I shudder to think what PS4/Xbox Next will look in their eyes.

With all other elements being the equal, more power means = more silicon = more heat = bigger cooling system = bigger console... all at more $. According to Nintendo, they believe they've found the perfect balance on a performance per watt @ X size & pice basis. Of course, you can't blame anyone for wanting more, but based on those factors, what the Wii U does with "so little" is already pretty impressive, and will only get better as the console is pushed.

I just feel like a bit more BOM, bit bigger console, bit more power usage and even a bit more cost to the consumer could have really helped in two main ways:

1) It could have given it a really good first showing WRT 360/PS3 (i.e. Great marketting)
2) Those launch ports might have been able to run slightly higher framerate or resolution
3) It would have set it up a bit better against the Orbis/Durango.

I realise there's a line you have to draw int the sand and say "This is it" and that you could have always gone a little higher or lower, but it's looking like Nintendo just went a bit far in the too lower direction. We'll see soon enough I guess. I really hope Retro's game is a graphical showpiece for the system.
 

blu

Wants the largest console games publisher to avoid Nintendo's platforms.
Ok, a heretical thought occurred to me on my way back home from work today re the asymmetric caches. I think I'll toss it here.

What if one of the CPUs has so much more cache than the others because.. it just makes sense from an application's point of view? Let me explain.

Imagine you have a fixed silicon budget - you have 3 cores, and a fixed amount of cache to go with those at a given fab node. Your first thought would be to make it all symmetrical, yes? Well, perhaps not. Because you know two things:

1. Your system RAM is of less-than-stellar characteristics (and even if BW was much better, latency would not have been).
2. More often than not, games threat their multiple cores as domain workers - e.g. core 0 does the physics, core 1 does the draw calls, core 2 does the game logic, core 3 does the sound (blessed be those DSP-less systems), etc. Now, of those domains, not all workloads tend to have equal memory access patterns - some of those can do with a more streaming fashion access patterns (i.e. easier to handle with smaller caches), whereas some others can be the caches worst nightmare. By making things absolutely symmetrical you'd give some of those domains more cache than they'd actually need, while at the same time depriving other domains from the cache they could make good use of.

So, think again - would you be going symmetrical?
 

cyberheater

PS4 PS4 PS4 PS4 PS4 PS4 PS4 PS4 PS4 PS4 PS4 PS4 PS4 PS4 PS4 PS4 PS4 Xbone PS4 PS4
Ok, a heretical thought occurred to me on my way back home from work today re the asymmetric caches. I think I'll toss it here.

What if one of the CPUs has so much more cache than the others because.. it just makes sense from an application's point of view? Let me explain.

Imagine you have a fixed silicon budget - you have 3 cores, and a fixed amount of cache to go with those at a given fab node. Your first thought would be to make it all symmetrical, yes? Well, perhaps not. Because you know two things:

1. Your system RAM is of less-than-stellar characteristics (and even if BW was much better, latency would not have been).
2. More often than not, games threat their multiple cores as domain workers - e.g. core 0 does the physics, core 1 does the draw calls, core 2 does the game logic, core 3 does the sound (blessed be those DSP-less systems), etc. Now, of those domains, not all workloads tend to have equal memory access patterns - some of those can do with a more streaming fashion access patterns (i.e. easier to handle with smaller caches), whereas some others can be the caches worst nightmare. By making things absolutely symmetrical you'd give some of those domains more cache than they'd actually need, while at the same time depriving other domains from the cache they could make good use of.

So, think again - would you be going symmetrical?

Interesting idea. Do you think there are more then 3 cores in WiiU CPU? If there are only 3 logical/physical cores then it would make sense to have a symmetrical design.
 

blu

Wants the largest console games publisher to avoid Nintendo's platforms.
I'll quote myself from earlier in the thread:

I don't think it's that crazy an idea. The assumption that since the cache sizes are different the cores must be different is just that, an assumption.
My apologies for not seeing your post back then. I agree it's not a crazy idea at all. To the point of 1:4 asymmetry ratio being ok.
 
Ok, a heretical thought occurred to me on my way back home from work today re the asymmetric caches. I think I'll toss it here.

What if one of the CPUs has so much more cache than the others because.. it just makes sense from an application's point of view? Let me explain.

Imagine you have a fixed silicon budget - you have 3 cores, and a fixed amount of cache to go with those at a given fab node. Your first thought would be to make it all symmetrical, yes? Well, perhaps not. Because you know two things:

1. Your system RAM is of less-than-stellar characteristics (and even if BW was much better, latency would not have been).
2. More often than not, games threat their multiple cores as domain workers - e.g. core 0 does the physics, core 1 does the draw calls, core 2 does the game logic, core 3 does the sound (blessed be those DSP-less systems), etc. Now, of those domains, not all workloads tend to have equal memory access patterns - some of those can do with a more streaming fashion access patterns (i.e. easier to handle with smaller caches), whereas some others can be the caches worst nightmare. By making things absolutely symmetrical you'd give some of those domains more cache than they'd actually need, while at the same time depriving other domains from the cache they could make good use of.

So, think again - would you be going symmetrical?

Good stuff, blu. I'm pretty satisfied with this explanation. So what domains in particular would the extra cache serve to benefit? AI?
 

Durante

Member
Ok, a heretical thought occurred to me on my way back home from work today re the asymmetric caches. I think I'll toss it here. [...]
So, think again - would you be going symmetrical?
I agree with your speculation, since it's basically the inverse argument of what I said a page back ;)
This premise is a bit too simplified IMHO. If "chewing through more data" just means doing more streaming computations, then in many cases more cache won't help you since there just isn't enough temporal reuse.

In fact, if you had 2 almost identical cores, the only difference being that one has a small cache and the other a large cache, you'd probably want to do streaming calculations with a low reuse factor on the former and things like latency-bound data structure traversal (which is usually not SIMD) on the latter.

The reason I posted that is that experimenting with cores which differ in cache size and how to efficiently automatically schedule applications across them is one thing we are investigating in a project related to low-power embedded systems. If you can expect people to manually decide which tasks to place on which core (as you probably can in a console) then all the better!
 

wsippel

Banned
I did some numerology again:

I was told the DSP would be running at 120MHz. Looking at Nintendo's MO, it's probably not really 120MHz, but 121.5MHz - same base clock as the Wii. Nintendo likes clean multipliers, so I would assume the RAM to be clocked at 729MHz (6 x 121.5). Same as the Wii CPU. Nintendo likes to keep RAM and CPU in sync, so the CPU should be running at 1458MHz (12 x 121.5). Accordingly, the GPU would be clocked at 486MHz (4 x 121.5), and the eDRAM at either 486 or 729MHz.

I don't know why Nintendo always seems to do this. I guess using a single fixed base clock and only changing multipliers for various components is simpler. And it definitely gives more predictable results. I don't see Nintendo giving that up.
 

efyu_lemonardo

May I have a cookie?
So, think again - would you be going symmetrical?

Very reasonable thought, considering this is a console. But, just to be certain, would it have been much more expensive to just go with a shared cache?

I don't know why Nintendo always seems to do this. I guess using a single fixed base clock and only changing multipliers for various components is simpler. And it definitely gives more predictable results. I don't see Nintendo giving that up.
Keeping the same base clock as the Wii would make it very easy to underclock for exact bc, but this isn't my area of expertise so I'm not certain if this is a good enough reason.
 

mrklaw

MrArseFace
The theory seems sound, and is basically what Durante was talking about before. But I'd like to see the examples of engine thread/domain usage from the big third party developers, which Nintendo used as evidence to customise like this. Or did they simply do it because that's how their in-house developers approach things?
 

AzaK

Member
Good stuff, blu. I'm pretty satisfied with this explanation. So what domains in particular would the extra cache serve to benefit? AI?

And something I'd want to know the answer to is, could this explain shit ports? Could engines like UE3 not be allocating the right thread to the right core? I assume it must have settings to adjust this.
 
I've been mulling over the CPU to GPU connection today. What would you guys consider a realistic bus width? Xenos has a 512-bit bus between its parent and daughter die. So I'd think anywhere between a 64-bit bus (as in Hollywood) and that is a possibility. Also noticed these quotes in the Wii U console Iwata Asks:

Takeda said:
This time we fully embraced the idea of using an MCM for our gaming console. An MCM is where the aforementioned Multi-core CPU chip and the GPU chip10 are built into a single component. The GPU itself also contains quite a large on-chip memory. Due to this MCM, the package costs less and we could speed up data exchange among two LSIs while lowering power consumption. And also the international division of labor in general, would be cost-effective.

Iwata said:
Compared to power flowing between chips in separate physical positions on the board, you can get by with less power inside a small module. The latency is also reduced, and the speed increases.

It definitely sounds to me that they sped up the connection between the GPU and CPU (so greater than the 64-bit connection of GCN/Wii) and this extra speed is coming from more than just the increased clocks of the 2 LSIs. Given what we know of the CPU, what would be an appropriate data rate in order to make accessing the GPU's 32 MB eDRAM actually useful? For example, if we are to take wsippel's proposed GPU/eDRAM clock of 486 Mhz and suppose a 256-bit bus, we'd get something like 15.5 GB/s data rate. Double the bus width to Xenos' 512 and we've got a decent ~31 GB/s. Would that last figure be feasible? Would it be overkill considering we're talking a modestly clocked CPU with no SMT and no VMX?
 

AzaK

Member
It definitely sounds to me that they sped up the connection between the GPU and CPU (so greater than the 64-bit connection of GCN/Wii) and this extra speed is coming from more than just the increased clocks of the 2 LSIs. Given what we know of the CPU, what would be an appropriate data rate in order to make accessing the GPU's 32 MB eDRAM actually useful? For example, if we are to take wsippel's proposed GPU/eDRAM clock of 486 Mhz and suppose a 256-bit bus, we'd get something like 15.5 GB/s data rate. Double the bus width to Xenos' 512 and we've got a decent ~31 GB/s. Would that last figure be feasible? Would it be overkill considering we're talking a modestly clocked CPU with no SMT and no VMX?

Silly question incoming. But if the main RAM <-> CPU is a paltry 17GB/s, then wouldn't EDRAM be much, much higher than that? Suggesting 31GB/s is not much more than the regular RAM speed of the 360/PS3.
 
Silly question incoming. But if the main RAM <-> CPU is a paltry 17GB/s, then wouldn't EDRAM be much, much higher than that? Suggesting 31GB/s is not much more than the regular RAM speed of the 360/PS3.

Not silly at all. I'm trying to wrap my head around it myself and may have gotten the calculations wrong. But the way I understand it, eDRAM is not dual data rate and the main benefits are that it can be placed on chip. Hence, you can take advantage of faster on-chip interconnects, increase bandwidth, and reduce latency a great deal without driving up heat/cost that would come from a high-bandwidth off-chip interface.
 

pottuvoi

Banned
I've been mulling over the CPU to GPU connection today. What would you guys consider a realistic bus width? Xenos has a 512-bit bus between its parent and daughter die.
Where did you get the 512bit bus from GPU to Daughter die?

This bus between the graphics core and the EDRAM die is a chip-to-chip bus (via substrate) operating at 1.8 GHz and 28.8 Gbytes/s.
From a x360 white paper indicates it to be 64bit bus. (Edit: strangely many other sources indicate bandwidth to be 32GB/s.)
http://www.cis.upenn.edu/~milom/cis501-Fall08/papers/xbox-system.pdf


I do agree that on WiiU the bus between CPU&GPU can be quite fast and could mean nice opportunity for a CPU/GPU interaction.
 

Thraktor

Member
Shared L2 seems to be an outdated concept. I'm not aware of a single current processor using shared L2.

Bluegene/Q. The relevant fact here, though, is that IBM's eDRAM cache is designed as a shared cache, and the shared cache on all of IBM's CPUs is now implemented as eDRAM, whether it's L2, as in Bluegene/Q, or L3 as in Power7, or even L4, as in zEC12. This is what puzzles me a bit about the symmetric cores/asymmetric cache theory. On the one hand it makes sense if you know that the tasks handled by the cores are going to have very different cache requirements (which is a fair assumption in a video game console). On the other hand, if these cache requirements aren't explicitly known in advance, then wouldn't it make more sense to use a quasi-shared cache which dynamically re-allocates itself between cores as needed? Like, for example, the eDRAM cache that Nintendo's actually using in the Wii U's CPU...

The theory seems sound, and is basically what Durante was talking about before. But I'd like to see the examples of engine thread/domain usage from the big third party developers, which Nintendo used as evidence to customise like this. Or did they simply do it because that's how their in-house developers approach things?

At a guess, I'd say it's due to testing on their internal engines, but there were reports a while back about Nintendo tweaking the hardware after testing third party engines on it, so that may have been the cause. Alternatively, it could have been based on the performance of the bundled middleware from Havok and Autodesk.

I've been mulling over the CPU to GPU connection today. What would you guys consider a realistic bus width? Xenos has a 512-bit bus between its parent and daughter die. So I'd think anywhere between a 64-bit bus (as in Hollywood) and that is a possibility. Also noticed these quotes in the Wii U console Iwata Asks:

It definitely sounds to me that they sped up the connection between the GPU and CPU (so greater than the 64-bit connection of GCN/Wii) and this extra speed is coming from more than just the increased clocks of the 2 LSIs. Given what we know of the CPU, what would be an appropriate data rate in order to make accessing the GPU's 32 MB eDRAM actually useful? For example, if we are to take wsippel's proposed GPU/eDRAM clock of 486 Mhz and suppose a 256-bit bus, we'd get something like 15.5 GB/s data rate. Double the bus width to Xenos' 512 and we've got a decent ~31 GB/s. Would that last figure be feasible? Would it be overkill considering we're talking a modestly clocked CPU with no SMT and no VMX?

I would expect that the benefit of the connection between the CPU and eDRAM would be more a matter of latency than bandwidth, particularly when you're looking at sharing compute loads between CPU and GPU with the eDRAM as a scratchpad.
 
Where did you get the 512bit bus from GPU to Daughter die?


From a x360 white paper indicates it to be 64bit bus. (Edit: strangely many other sources indicate bandwidth to be 32GB/s.)
http://www.cis.upenn.edu/~milom/cis501-Fall08/papers/xbox-system.pdf


I do agree that on WiiU the bus between CPU&GPU can be quite fast and could mean nice opportunity for a CPU/GPU interaction.

Hmmm, looking at that white paper provided some additional info I was unaware of. I got the 512-bit by looking at a Beyond3D thread, but I think they incorrectly assumed that the internal bus was running at 500 Mhz. Since it's running at 1.8 Ghz apparently, it means there is a mere 128-bit connection there, correct? So perhaps that is a more realistic upper boundary for the Wii U MCM.

When looking at Xenos specs there's some confusion between the interface between the two dies and the internal interface within the daughter die between the integrated ROPs and eDRAM.

Not in this case. See above.

I would expect that the benefit of the connection between the CPU and eDRAM would be more a matter of latency than bandwidth, particularly when you're looking at sharing compute loads between CPU and GPU with the eDRAM as a scratchpad.

You are probably correct that latency is the more important issue here. Still, more bandwidth couldn't hurt, right?
 

Thraktor

Member
A few calculations on the GPU:

From the Anandtech teardown, the Wii U's GPU die is 156.21mm². We can assume, although it must be stressed that it's only an assumption, that the GPU is manufactured on a 40nm process. The die likely contains a GPU derived from AMD's R700 series, 32MB of Renesas eDRAM, a memory controller, a pair of ARM cores and a DSP.

First, the eDRAM. Assuming a 40nm process, the eDRAM is going to be Renesas UX8LD, which comes in three configurations:

64Mb/256bit -> 1024bit interface for 32MB -> 51.2GB/s to 102.4GB/s
8Mb/256bit -> 8192bit interface for 32MB -> 409.6GB/s to 819.2GB/s
8Mb/128bit -> 4096bit interface for 32MB -> 204.8GB/s to 409.6GB/s

(The bandwidth ranges are based on a clock range of 400MHz to 800MHz)

Bandwidth of 400-800GB/s would certainly be massive overkill (as a comparison, the highest bandwidth on any currently available consumer GPU is the Radeon 7970's 288GB/s, and that's targeting much higher resolutions than the Wii U's 720p standard). The 1024bit interface at a high clock or the 4096bit at a low clock are probably the most likely, and the clock is likely to be either equal to, or a clean multiple of, the GPU's clock. For reference, the on-die interface between Xenon's ROPs and eDRAM is 4096bit at 500MHz for 256GB/s of bandwidth.

The cell size of UX8LD is 0.06 square micron meters, which, if my maths hasn't failed me, means a total of 16.1mm² for 32MB. This leaves ~140.11mm² for the rest of the die.

Onto the ARM core(s). Renesas offers the following ARM cores on its 40nm process:

ARM9, ARM11, ARM11MP core, Cortex R4, A5, A8, A9

We can rule out ARM9 and ARM11, as they're single-core architectures. I also feel we can probably rule out the A8 and A9, as they're targeted at higher performance applications than the security/IO co-processor role they're likely to serve in the Wii U. Of the remaining three, I feel the Cortex A5 is the most likely bet. Why? It was revealed in 2009 (when development of the Wii U hardware was beginning) and is, apparently "the smallest and lowest power ARM multicore processor". It's also used in a very similar role (as a security co-processor) in AMD's 2013 APUs, which indicates its suitability. How big exactly are Cortex A5 cores?

Produced using a technique of 40 nm each Cortex A5 occupies an area of only 0.9 mm ² (including the 64 KB L1 cache)

(Source)

That comes to just 1.8mm² for the two cores, leaving us with ~138.31mm² left for the rest of the die.

On the DSP front, it gets a bit trickier. The DSP used in the Gamecube and Wii was designed by Macronix, who have since spun off their DSP design business to a company called Modiotek. Modiotek's current product line seems to be ill-suited for what Nintendo are looking for, though, as they're targeting digital answering machines and low-cost phones. Nintendo seem to agree, as for the 3DS DSP they instead went to a company called CEVA, whose TeakLite line is more suitable for gaming devices. The 3DS's DSP is, according to wsippel*, a modified TeakLite I, and a modified TeakLite III or IV seem sensible choices for the Wii U. The TeakLite IV is probably the more interesting one, as CEVA specifically refers to it a number of times as being suitable for games consoles. It was only announced earlier this year, though, but it's possible Nintendo could have had early access to the design. Otherwise the TeakLite III seems feasible. One issue with the TeakLite family is the reported clockspeed of 121.5MHz. Both the TeakLite III and IV can hit 1GHz+, so it seems odd to have it running so low (much lower than the GPU it's embedded in, in fact).

Another possibility is that the DSP is ARM-based, as there are DSP extensions to the ARM architecture, including this particular one from NXP which Fourth Storm posted about a while back*, which is based on the Cortex-M3, and happens to run at 120MHz. Nintendo had the opportunity of going with an ARM-based DSP in the 3DS, though, and decided against it in favour of a dedicated architecture, so it would seem odd for them to take the opposite decision with the Wii U. It's also possible that Nintendo went for one of Renesas's DSP cores, in particular the SH3-DSP, but once again this is a CPU architecture repurposed as a DSP, so doesn't seem consistent with Nintendo's previous decisions. There are of course other dedicated DSP designers than CEVA, but it'd be a stab in the dark trying to pick the one which Nintendo would have gone with.

Let's just assume, for the moment, that Nintendo have gone with a CEVA TeakLite III DSP, and have kept a redundant GC/Wii Macronix DSP on there for BC purposes. The TeakLite III core is 0.47mm² on a 65nm process. I'd say it's fair to assume that, on 40nm, the TeakLite III + Macronix DSP shouldn't be larger than 1mm² combined, or at least any other DSP they might choose should be around that ballpark. This would bring our total remaining die to ~137.31mm², which would account for the GPU and memory controller.

Here's the thing. From very early on in the speculation threads, when we heard that the GPU was based on the R700 line, the RV740 was identified as the most likely candidate for a base for the chip. It's designed for a 40nm manufacturing process, it's fits the reported 640 shader count, and (clocked down) it fits our performance expectations. It also happens that the RV740 die (which includes the memory controller) is exactly 137mm². Now, of course I've made a number of assumptions in my calculations, and of course any modifications Nintendo would have made to the RV740 would be quite unlikely to leave it at the exact same size, but it's still astonishing how close the GPU's die size corresponds to what we'd expect from something based on the RV740, and at this point I'd be very surprised if it were anything but.

*It's always fun how often I come across GAF posts while researching these things.
 

Argyle

Member
2. More often than not, games threat their multiple cores as domain workers - e.g. core 0 does the physics, core 1 does the draw calls, core 2 does the game logic, core 3 does the sound (blessed be those DSP-less systems), etc.

Is this really a common strategy in a modern game engine, though? I am sure some engines still operate this way (UE3 comes to mind with its two threads, unless it has been changed in the last few years), but I guess my impression was that things were moving towards things like job queues with any available worker thread/core picking up the next job that is ready to run.
 

AlStrong

Member
Hmmm, looking at that white paper provided some additional info I was unaware of. I got the 512-bit by looking at a Beyond3D thread

Not sure which thread (there were lots), but the only large bus is between the eDRAM and the ROPs (2048-bit @500MHz, read + write)

The cell size of UX8LD is 0.06 square micron meters, which, if my maths hasn't failed me, means a total of 16.1mm² for 32MB. This leaves ~140.11mm² for the rest of the die.

DRAM is built in groups/macros, which entail overhead for wiring and such.

For example, TSMC reported 0.0583 um^2 cell size with 0.145mm^2 per Mbit ( using 32k x 128-bit macro i.e. 4Mbit) with their process (~2.5x size for this case) to create a 16MB (128Mbit) eDRAM on 40nm (should be a 2048-bit connection, read/write, in this example). There could be a fair bit more overhead depending on how they wire up the cells to various parts of the GPU and/or a separate bus for CPU (not unlike the PS2's configuration, 1024-bit read + 1024-bit write + 512-bit texture). Alternatively, they could divvy up such a wide-bus between the various aspects instead of adding more.


edit: multiporting/re-using same ports is possible of course, though still larger cells.
----

Btw, I guess no one saw my earlier post (or cared ;_;) in the other thread, but TSMC apparently bought Renesas Yamagata fab sometime this year. :p
 

Datschge

Member
Another possibility is that the DSP is ARM-based, as there are DSP extensions to the ARM architecture, including this particular one from NXP which Fourth Storm posted about a while back*, which is based on the Cortex-M3, and happens to run at 120MHz. Nintendo had the opportunity of going with an ARM-based DSP in the 3DS, though, and decided against it in favour of a dedicated architecture, so it would seem odd for them to take the opposite decision with the Wii U. It's also possible that Nintendo went for one of Renesas's DSP cores, in particular the SH3-DSP, but once again this is a CPU architecture repurposed as a DSP, so doesn't seem consistent with Nintendo's previous decisions. There are of course other dedicated DSP designers than CEVA, but it'd be a stab in the dark trying to pick the one which Nintendo would have gone with.

Could such an ARM-based DSP be more flexibly reprogrammable to allow it to behave like the Macronix DSP in Wii BC mode using specially adapted firmware? If not having both CEVA TeakLite III/IV and Macronix DSPs at once is more likely but also kind of a waste.
 

AzaK

Member
The die likely contains a GPU derived from AMD's R700 series, 32MB of Renesas eDRAM, a memory controller, a pair of ARM cores and a DSP.

Would the ARM cores have to be on the same die, or could they be in that little unknown die that we saw sitting on the MCM?

Edit: Oh, and thanks for the thoughts. Interesting read.
 

blu

Wants the largest console games publisher to avoid Nintendo's platforms.
Is this really a common strategy in a modern game engine, though? I am sure some engines still operate this way (UE3 comes to mind with its two threads, unless it has been changed in the last few years), but I guess my impression was that things were moving towards things like job queues with any available worker thread/core picking up the next job that is ready to run.
Jobs queues and job domain differentiation are not mutually exclusive concepts. Actually, I'd say they're orthogonal. You can have different queues for different job domains, each queue handled by a best-fit (set of) core(s). For the jobs ending up on the same set of cores, you want them to have a certain level of access patterns uniformity, i.e. that the jobs shared the same locality patterns (read: cache needs), and for those landing on the exact same core - that they shared the actual localities to some degree so you'd reuse your cache content, rather than thrash it. A deviation from the rule is Cell-like architectures where the perfectly-identical cores (read: SPUs) don't really have cache. Instead, they have scratchpad, with which the SPUs effectively delegate the caching problem to the developer - they can reset/reload the scratchpad with each new job, or try to achieve some effective caching - it's their problem.

Answering my own question.

Yes, the bus is bidirectional so the theoretical max sustained read bandwidth for the Wii U would the same as the total theoretical bandwidth. And higher than the Xbox 360's read bandwidth if it does turn out to be ~12GB/s.

So one of the secrets to getting the maximum bandwidth out of the Wii U GPU would be to avoid write operations to the main memory as much as possible. On the 360, render to texture requires rendering to the 10MB eDRAM, and then transferring that texture to the shared RAM. On the Wii U you'd want to keep everything on-package within the eDRAM to avoid having to use some of your bandwidth for writing.

You'll also optimize for maximum bandwidth utilization the opposite way from the 360. On the 360, any time you're writing but not reading, or reading but not writing you're wasting potential bandwidth. On the Wii U you'll want to do the opposite. Try not to do any writing while reading, and group your writes together in bursts to avoid the penalty from switching the bus direction.
Duh, completely forgot about the direction split of the 360 RAM bus. Thank you for the heads-up. And yes, your conclusion re the use-case differences is perfectly sound.
 

Argyle

Member
Jobs queues and job domain differentiation are not mutually exclusive concepts. Actually, I'd say they're orthogonal. You can have different queues for different job domains, each queue handled by a best-fit (set of) core(s). For the jobs ending up on the same set of cores, you want them to have a certain level of access patterns uniformity, i.e. that the jobs shared the same locality patterns (read: cache needs), and for those landing on the exact same core - that they shared the actual localities to some degree so you'd reuse your cache content, rather than thrash it. A deviation from the rule is Cell-like architectures where the perfectly-identical cores (read: SPUs) don't really have cache. Instead, they have scratchpad, with which the SPUs effectively delegate the caching problem to the developer - they can reset/reload the scratchpad with each new job, or try to achieve some effective caching - it's their problem.

IMHO I think the bigger problem for most engines is simply keeping all the cores busy as much as possible, more than worrying about which jobs are going to stay parked on which cores to avoid thrashing the cache (at the cost of letting cores go idle).

I would expect any half decently optimized multiplatform engine to have already SPUified a lot of the work if it is running on PS3 anyway. I don't know much about the Wii U but apparently there is a chunk of edram that is shared between the GPU and CPU? If I were porting to Wii U from a multiplatform engine, I think I would try to allocate 256kb blocks per worker core in the edram and attempt to reuse as much of my PS3 job code as I can, to try to avoid hitting the main memory as much as possible (only reading/writing explicitly to main ram from edram the way you would DMA in and out of SPU ram). Even so, I would expect performance to be extremely lackluster vs. the PS3 if the Wii U is clocked at around half the speed.

I guess I still don't see the point of the larger cache on one core, does this mean the L2 cache is not shared between cores?
 

Matt

Member
I did some numerology again:

I was told the DSP would be running at 120MHz. Looking at Nintendo's MO, it's probably not really 120MHz, but 121.5MHz - same base clock as the Wii. Nintendo likes clean multipliers, so I would assume the RAM to be clocked at 729MHz (6 x 121.5). Same as the Wii CPU. Nintendo likes to keep RAM and CPU in sync, so the CPU should be running at 1458MHz (12 x 121.5). Accordingly, the GPU would be clocked at 486MHz (4 x 121.5), and the eDRAM at either 486 or 729MHz.

I don't know why Nintendo always seems to do this. I guess using a single fixed base clock and only changing multipliers for various components is simpler. And it definitely gives more predictable results. I don't see Nintendo giving that up.

I wouldn't get too into the multiplier idea.
 

Osiris

I permanently banned my 6 year old daughter from using the PS4 for mistakenly sending grief reports as it's too hard to watch or talk to her
I saw that pre-edit, you tease you ;)
 
I really wish we were getting some decent information leaking out from developers. What's wrong out there? Don't they burn to share? What the hell does the edram do?
 

Panajev2001a

GAF's Pleasant Genius
Ok, I think it's about time we tried to put all known Wii U specification things into its own thread and try to have a civil discussion.

Hard facts (either publicly disclosed, or a non-public leak which can be vouched by somebody trustworthy on this very forum):
  • MCM design: GPU+eDRAM die and CPU die on the same substrate.
  • 2 GB of gDDR3 memory @800MHz (DDR3-1600), organized in 4x 4Gb (256Mx16) modules, sitting on a 64bit bus (@800MHz). That gives a net BW of 12800MB/s (12.5GB/s). We can conveniently refer to this pool as 'MEM2'. Currently 1GB of that pool is reserved for the OS.
  • 32 MB of unknown organisation, unknown specs eDRAM, sitting with the GPU. We can conveniently refer to this pool as 'MEM1'
  • Tri-core CPU, binary compatible with Gekko/Broadway, featuring 3MB of cache in asymmetric config: 2x 512KB, 1x 2048KB; so far several things indicate CPU cache is implemented via eDRAM itself. Unknown clock, unknown architecture enhancements (e.g. SIMD, etc).
  • AMD R700-originating GPU (R700 is AMD architecture codename 'Wekiva'), evolved into its own architecture (AMD architecture codename 'Mario'), relying on MEM1 for framebuffer purposes, but also for local render targets and scratch-pad purposes.
  • Memory access specifics: both MEM1 and MEM2 are read/write accessible by the CPU, both subject to caching. GPU in its turn also has access to both pools, and is likely serving as the north bridge in the system (an educated guess, subject to calling out).
  • System is equipped with extra co-processors in the shape of a dual-core ARM (unknown architecture) and a DSP core (again of unknown architecture) primarily for sounds workloads.
  • BluRay-based optical drive, 22.5MB/s, 25GB media.

Immediate logical implications from the above (i.e. implications not requiring large leaps of logic):
  • Not all WiiU CPU cores are equal - one of them is meant to do things the other two are not. Whether that is related to BC, OS tasks, or both, is unclear.
  • When it comes to non-local GPU assets (read: mainly textures), WiiU's main RAM BW increase over nintendo's most advanced SD platform (i.e. 5.48GB/s -> 12.5GB/s) hints that WiiU is mainly targeted at a 2x-3x resolution increase over the Wii, or IOW, 480p -> 720p.
  • The shared access to MEM1 pool by the GPU and CPU alike indicated the two units are meant to interact at low latency, not normally seen in previous console generations. Definitely a subject for interesting debates this one is.

Thanks for the summary, it puts the Wii U under an interesting lens. I am intrigued by the fact that MEM1 is not just for framebuffer purposes and that local render targets and other operations can be performed without having to spend time and bandwidth resolving out to main RAM. What you are saying there implies that the front display buffer resides in MEM2 though, correct (like on GCN and Wii)? The CPu and GPU both being able to write and read data at high speed and low latency on MEM1 could open very interesting possibilities, unless there is some weird bottleneck.

When FlexIO specs were revealed, many coders were salivating at the prospect of CPU and GPU talking over a fast two way buss with over 20 GB/s of bandwidth each way... then it was revealed that the CPU could not read or write anywhere near that speed, but only the GPU could... you seem to be saying that this is not the case on Wii U, knowing Nintendo's history it should not be the case.
 

blu

Wants the largest console games publisher to avoid Nintendo's platforms.
IMHO I think the bigger problem for most engines is simply keeping all the cores busy as much as possible, more than worrying about which jobs are going to stay parked on which cores to avoid thrashing the cache (at the cost of letting cores go idle).
Cache thrashing is equivalent to core idling. Particularly on cores where the context switch is not such a heavy op, the waiting for the re-populating of the caches with the relevant data can be anything but cheap. Gone are the times when ALU prowess would win the day - sustained CPU throughput is primarily about data access patterns today, ALU coming second.

I would expect any half decently optimized multiplatform engine to have already SPUified a lot of the work if it is running on PS3 anyway. I don't know much about the Wii U but apparently there is a chunk of edram that is shared between the GPU and CPU? If I were porting to Wii U from a multiplatform engine, I think I would try to allocate 256kb blocks per worker core in the edram and attempt to reuse as much of my PS3 job code as I can, to try to avoid hitting the main memory as much as possible (only reading/writing explicitly to main ram from edram the way you would DMA in and out of SPU ram). Even so, I would expect performance to be extremely lackluster vs. the PS3 if the Wii U is clocked at around half the speed.
Avoiding hitting main mem is definitely a sound priority, but by going ps3-style on the WiiU you would under-utilize the platform - as you note WiiU cores are not SPUs - the latter have much less local memory, and much higher ALU rates (not to mention I don't expect U-CPU's conduit to GPU's eDRAM to hit SPU scratchpad levels of performance in either BW or latency). IMO WiiU won't be the ideal port buddy for ps3-tailored pipelines. But then again, I don't see many platforms that would be good ps3 port buddies (yes, I'm aware of the Frostbite pipeline, no, I don't expect it to run equally well on the ps3 and 360 without a good doze of platform specialization).

I guess I still don't see the point of the larger cache on one core, does this mean the L2 cache is not shared between cores?
No, cores have their own L2 caches. You might be able to effectively use GPU's eDRAM as L3 shared cache scratchpad, which I'd argue would be a better approach than playing SPU-style isolated scratchpads in the same pool.
 

ozfunghi

Member
I wouldn't get too into the multiplier idea.

Matt, when you said the GPU clock was "a little slower" than 600Mhz... was that to be taken literally or not? Many seem to think you really meant "little" when you said "little" and are hoping for it to be somewhere between 550-599 Mhz for instance. While it could be interpreted as "it's going to be slower, guys" which could mean anything as low as 400 Mhz.
 

Thraktor

Member
DRAM is built in groups/macros, which entail overhead for wiring and such.

For example, TSMC reported 0.0583 um^2 cell size with 0.145mm^2 per Mbit ( using 32k x 128-bit macro i.e. 4Mbit) with their process (~2.5x size for this case) to create a 16MB (128Mbit) eDRAM on 40nm (should be a 4096-bit connection in this example). There could be a fair bit more overhead depending on how they wire up the cells to various parts of the GPU and/or a separate bus for CPU (not unlike the PS2's configuration, 1024-bit read + 1024-bit write + 512-bit texture). Alternatively, they could divvy up such a wide-bus between the various aspects instead of adding more.

Yeah, I didn't really know what, if any, overhead was required. It would certainly affect the calculations, although it's hard to say by how much without numbers from Renesas. I would imagine the 64Mb/256bit macros would maximise density/minimise overhead.

Btw, I guess no one saw my earlier post (or cared ;_;) in the other thread, but TSMC apparently bought Renesas Yamagata fab sometime this year. :p

Apparently that was just a rumour.

Could such an ARM-based DSP be more flexibly reprogrammable to allow it to behave like the Macronix DSP in Wii BC mode using specially adapted firmware? If not having both CEVA TeakLite III/IV and Macronix DSPs at once is more likely but also kind of a waste.

The issue here is that a DSP is effectively a highly specialised CPU. It has instructions, registers, ALUs, etc, it's just that the whole thing's put together in a way which optimises signal processing. As such, for another chip to run Wii's DSP code, it either needs to be binary compatible with that DSP (ie it must have the same instruction set as the DSP, or a superset thereof) in which case it effectively is a Macronix DSP, or it has to emulate the instruction set. If the latter (which is possible), then it makes more sense to run that emulation on a core of the PPC CPU.

Would the ARM cores have to be on the same die, or could they be in that little unknown die that we saw sitting on the MCM?

Edit: Oh, and thanks for the thoughts. Interesting read.

The Wii's ARM chip was on-die with the GPU, so I was just assuming the same this time around. As far as the little die, I was assuming it was EEPROM or something, I seem to recall Wii having similar.

No, cores have their own L2 caches. You might be able to effectively use GPU's eDRAM as L3 shared cache scratchpad, which I'd argue would be a better approach than playing SPU-style isolated scratchpads in the same pool.

The cores have their own allocated caches, but I'm still of the opinion that it's quasi-shared, in the sense that a given core's cache can "overflow" into another core's as necessary, at the expense of latency. Given that IBM's eDRAM cache is always implemented as either shared or quasi-shared, I think it's a reasonable assumption.

Also, as we're giving up on clean multiples, may I point out that both the DDR3 and eDRAM have maximum clocks of 800MHz, and this might not be a co-incidence?

On the other hand, if there were two components I would sync the clocks of, it'd be the GPU and eDRAM, as when you're dealing with latencies in the order of a couple of cycles the effects of asynchronous clocks would proportionally be very high (and, perhaps more importantly, would be rather unpredictable).

Edit: Also, Blu, I just noticed that in the OP you wrote "so far several things indicate CPU cache is implemented via eDRAM itself". I think this can be pretty safely taken as fact. For one thing, IBM confirmed the use of eDRAM about a year and a half ago, and for another, 3MB of SRAM cache would literally take up the entire CPU die, leaving no space left for those cores we're so curious about.
 
The issue here is that a DSP is effectively a highly specialised CPU. It has instructions, registers, ALUs, etc, it's just that the whole thing's put together in a way which optimises signal processing. As such, for another chip to run Wii's DSP code, it either needs to be binary compatible with that DSP (ie it must have the same instruction set as the DSP, or a superset thereof) in which case it effectively is a Macronix DSP, or it has to emulate the instruction set. If the latter (which is possible), then it makes more sense to run that emulation on a core of the PPC CPU.

I believe the NXP chip was an interesting possibility, but BC does kind of throw a monkey wrench into that idea. Could Nintendo have just bought the IP from Macronix at some point and be free to make their own changes?



Thraktor said:
The Wii's ARM chip was on-die with the GPU, so I was just assuming the same this time around. As far as the little die, I was assuming it was EEPROM or something, I seem to recall Wii having similar.

We seem to be thinking very much alike lately! ;)


Thraktor said:
Also, as we're giving up on clean multiples, may I point out that both the DDR3 and eDRAM have maximum clocks of 800MHz, and this might not be a co-incidence?

On the other hand, if there were two components I would sync the clocks of, it'd be the GPU and eDRAM, as when you're dealing with latencies in the order of a couple of cycles the effects of asynchronous clocks would proportionally be very high (and, perhaps more importantly, would be rather unpredictable).

This definitely leaves some more possibilities open. It did somewhat irk me that if Nintendo stuck to clean multipliers, it would seem to arbitrarily keep some clocks extra low (RAM, CPU mainly), and only because you're working with the least important part of your system (DSP) as a base?. I disagree with your assessment that there are 640 ALUs on the GPU (I think half that more likely), but when I lowered my expectations in that regard, I figured they'd be able to crank the clock slightly higher (close to 600 Mhz). Just to grab a number out of the air, 575 Mhz with 320 SPUs would get you to 368 GFLOPS, just about 1.5x Xenos.

The system RAM, I'd run as high as possible (800 Mhz) and the CPU twice that for a clean 1.6 Ghz. As for the eDRAM, I'd still be surprised if it wasn't running at the same speed as the GPU, but perhaps it's also running at 800 Mhz in order to match the bandwidth of the main memory pool (assuming there is a 128-bit connection from CPU to GPU). Just a huge guess here, but wouldn't it be better for the eDRAM to have to wait around for the GPU a few extra cycles (and it can even be serving the CPU in that time) than the other way around?
 
I believe the NXP chip was an interesting possibility, but BC does kind of throw a monkey wrench into that idea. Could Nintendo have just bought the IP from Macronix at some point and be free to make their own changes?





We seem to be thinking very much alike lately! ;)




This definitely leaves some more possibilities open. It did somewhat irk me that at if Nintendo stuck to clean multipliers, it would seem to arbitrarily keep some clocks extra low (RAM, CPU mainly), and only because you're working with the least important part of your system (DSP) as a base?. I disagree with your assessment that there are 640 ALUs on the GPU (I think half that more likely), but when I lowered my expectations in that regard, I figured they'd be able to crank the clock slightly higher (close to 600 Mhz). Just to grab a number out of the air, 575 Mhz with 320 SPUs would get you to 368 GFLOPS, just about 1.5x Xenos.

The system RAM, I'd run as high as possible (800 Mhz) and the CPU twice that for a clean 1.6 Ghz. As for the eDRAM, I'd still be surprised if it wasn't running at the same speed as the GPU, but perhaps it's also running at 800 Mhz in order to match the bandwidth of the main memory pool (assuming there is a 128-bit connection from CPU to GPU). Just a huge guess here, but wouldn't it be better for the eDRAM to have to wait around for the GPU a few extra cycles (and it can even be serving the CPU in that time) than the other way around?




Considering the die size, 320 seems really low, or a waste of space.
 

AzaK

Member
Duh, completely forgot about the direction split of the 360 RAM bus. Thank you for the heads-up. And yes, your conclusion re the use-case differences is perfectly sound.
Direction split? So it can't read at the full BW (20 odd GB/s). Is the BW fixed for read and fixed for write at done portion of the total?


And Matt, Lherre, that wasn't fair to do to wsippel :)
 
Top Bottom