• Hey, guest user. Hope you're enjoying NeoGAF! Have you considered registering for an account? Come join us and add your take to the daily discourse.

WiiU technical discussion (serious discussions welcome)

blu

Wants the largest console games publisher to avoid Nintendo's platforms.
But GPGPU is almost by definition a brute force approach. It's all well and good to say that you can't achieve the peak efficiency quoted on the other console CPU's, but the fact is trying to do the same work on a GPU is way more wasteful. People only do it because the sheer number of calculations you can bring to bear on the problem is so much higher with so many ALUs on a GPU. But all the people out there using GPGPU for scientific research, supercomputers, bitcoin mining aren't stealing time from graphics work to do non-graphics calculations. Moving work from the CPU to the GPU may in some cases get the work done faster, but it is absurd to suggest it is a more efficient use of resources. This is especially true on older GPU architectures like the WiiU employs compared to recent architectures like GCN and Fermi that explicitly target higher GPGPU performance.
I'n not sure what you're saying here. Are you suggesting games should not use GPGPU per se?
 
I'n not sure what you're saying here. Are you suggesting games should not use GPGPU per se?

I'm saying there are good use examples for very specific, and limited work cases for GPGPU, but you can't expect to offload everything from the CPU to the GPU because in order to even hope to match the performance of the 360 and PS3 CPUs, you would have to reduce the actual graphics work load to almost nothing. In essence I'm saying what a number of us have been saying all along: GPGPU is not a magic solution to all the WiiU's shortcomings. It's not even a half measure solution.
 

On parallelizable number crunching tasks like matrix multiplication? Certainly. Though GPUs are faster for such tasks nowadays.
In any case, it is largly irrelevant for game code.

Certainly doesn't sound that bad and would indicate a quite decent throughput. (I would bet that it gets over 50gflops, at least in some cases.. ;)

Active SPU usage is nice and good but should not be confused with actual performance (it does not mean 65% of peak GFLOPs). Also, they probably only reach it because they use the SPUs to help out RSX quite a lot, which is relatively well suited for them.
 

Fredrik

Member
Interesting, it sounds like a PS3 situation, only inverted.

PS3 has a slow GPU but with time great devs like Naughty Dog and Guerrilla learned to use the CPU to fill in the gaps.
On WiiU the CPU is slow, so I guess in time great devs will learn to use the GPGPU to fill in some gaps.

Added bonus is that you have a dedicated chip doing OS and sound work, which will offload both the CPU and GPGPU and in turn fill in some more gaps.

Downside to all this is that it might take a few years before we see a boost in performance, especially in third party titles unless WiiU is the main platform right from the start. PS3 situation yet again. = Bad :/
 

pottuvoi

Banned
Active SPU usage is nice and good but should not be confused with actual performance (it does not mean 65% of peak GFLOPs). Also, they probably only reach it because they use the SPUs to help out RSX quite a lot, which is relatively well suited for them.
Not using it to help RSX in a vertex tasks would be stupid. ;)

Also SPUs are heavily used for non graphics tasks on games and should get quite decent performance over what is possible with PPU. (Which apparently is quite weak on ps3.)
 

hodgy100

Member
On parallelizable number crunching tasks like matrix multiplication? Certainly. Though GPUs are faster for such tasks nowadays.
In any case, it is largly irrelevant for game code.

But animations and movement of all objects use matrix multiplications!! if anything it is deeply ingrained into game code!!!!
 

mrklaw

MrArseFace
I'm saying there are good use examples for very specific, and limited work cases for GPGPU, but you can't expect to offload everything from the CPU to the GPU because in order to even hope to match the performance of the 360 and PS3 CPUs, you would have to reduce the actual graphics work load to almost nothing. In essence I'm saying what a number of us have been saying all along: GPGPU is not a magic solution to all the WiiU's shortcomings. It's not even a half measure solution.

How many of those good examples and limited work cases are good for games development? Isn't this a similar argument that as had with the PS3 and CELL?

If enough of the tasks regularly undertaken in games are suited to the types of processes carried out faster on a GPGPU, then it is reasonable to trade CPU silicon for extra GPU capacity.
 
Not using it to help RSX in a vertex tasks would be stupid. ;)

Agreed! I just wanted to say that it'd be more difficult to make full use of the SPUs for classical CPU game code.

Also SPUs are heavily used for non graphics tasks on games and should get quite decent performance over what is possible with PPU. (Which apparently is quite weak on ps3.)

I'm not saying they are useless, just not as useful as it might seem on paper. From what I've heard it's often difficult to reach even Xenon levels.

But animations and movement of all objects use matrix multiplications!! if anything it is deeply ingrained into game code!!!!

As far as I know this doesn't take too much of the total CPU time.


I'm no game programmer though and don't want to go out on a limb with all that. It's just what I heard from people that I think know what they are talking about (eg. an engine programmer from Crytek).
 

z0m3le

Banned
http://www.neogaf.com/forum/showpost.php?p=44966628&postcount=646 Just to bring up this post from a few pages back, I had a question...

If 1 Xenon core only out performed Broadway by 20%, than could this be why Wii U's CPU cores were originally clocked at 1GHz? (a clock speed increase of 37%) wouldn't this than put the Wii U's 3 cores at 1.24GHz with as good or better performance clock for clock as Broadway, quite a bit beyond Xenon (Broadway would match or beat a Xenon core at only 875MHz?) so more like a 1.5x (I know we love those multiples) or 50% faster than Xenon according to the emulator developer from the quoted post above?

Obviously for certain tasks that are SIMD heavy would still likely fall short on Wii U, but for general computing of game logic and what not, wouldn't Wii U's CPU cores reasonably faster? or was the developer missing some other key component of Xenon's power that was unlocked later on?
 

blu

Wants the largest console games publisher to avoid Nintendo's platforms.
I'm saying there are good use examples for very specific, and limited work cases for GPGPU, but you can't expect to offload everything from the CPU to the GPU because in order to even hope to match the performance of the 360 and PS3 CPUs, you would have to reduce the actual graphics work load to almost nothing.
I'm assuming you're referring to 'offload everything VMX/SPE from the CPU to the GPU'. /disclaimer

Of course you can't offload everything from a CPU SIMD to the GPU. I don't think anybody ever suggested that. You can offload workloads depending on (a) latency requirements and (b) memory access patterns. Situations where you need to perform a small number of vector ops quick so that some branchy game logic can decide what to do next will not translate well to GPGPU offloading, not without a major workflow redesign. But situations where the CPU was already doing massive-throughput SIMD in-bulk, at no ultra-tight latency requirements, are good candidates for offloading. You can think of it this way: what workloads on the Cell warranted offloading to the SPUs, and what workloads were best handled on-board the PPE FPU/SIMD?

As re the 'you would have to reduce the actual graphics workload to almost nothing' - you'd have to be more specific than that. What cases do you have in mind?

In essence I'm saying what a number of us have been saying all along: GPGPU is not a magic solution to all the WiiU's shortcomings. It's not even a half measure solution.
There are no magic solutions. In therms of efficiency, though, since you brought it up in your original post, allow me:

A single core from a Bloomfiled@3.2GHz here produces 10 GFLOP in a basic matmul synthetic test (same test I used for the Broadway/Ontario comparison earlier). In a similar matmul test (not even CL-based but GL-based, deduct points for rasterizer/TMU overhead) a 80-shader Evergreen@400MHz produces 13 GFLOP. The Bloomfiled (a Xeon W3565) is rated at 130W TDP, which equates to 32.5W per core. The Evergreen is a part of a 9W APU (2x Bobcats + 80-shader Evergreen). While both tests are synthetic, the GPU one actually multiplies two matrix vectors worth of 4MB each, so there's memory access patterns involved. The CPU test multiplies 128 bytes worth of matrices. Even at such favorable conditions for the CPU test, the power-efficiency advantage of the GPU is overwhelming.

You can say that matmul is too rudimentary a workload, and is already largely delegated to GPUs nowadays, and yet gamecode still does tons of matmuls today, for stuff ranging from physics, to skeletal animations, to visibility culls.
 

Durante

Member
I'm saying there are good use examples for very specific, and limited work cases for GPGPU, but you can't expect to offload everything from the CPU to the GPU because in order to even hope to match the performance of the 360 and PS3 CPUs, you would have to reduce the actual graphics work load to almost nothing. In essence I'm saying what a number of us have been saying all along: GPGPU is not a magic solution to all the WiiU's shortcomings. It's not even a half measure solution.
GPGPU is not a magic solution, but it is a solution that should be applicable in many (but not all) the circumstances where Cell's SPEs were used.

Which brings us back to my original point: when we compare Wii U's GPU to 360 and in particular PS3, do we need to account for the 100+ "missing" CPU GFlops? If we do refer to GPGPU in the CPU comparison (which I think most people agree on), then I think we have to. Otherwise it's a matter of having your cake and eating it too.
 

z0m3le

Banned
GPGPU is not a magic solution, but it is a solution that should be applicable in many (but not all) the circumstances where Cell's SPEs were used.

Which brings us back to my original point: when we compare Wii U's GPU to 360 and in particular PS3, do we need to account for the 100+ "missing" CPU GFlops? If we do refer to GPGPU in the CPU comparison (which I think most people agree on), then I think we have to. Otherwise it's a matter of having your cake and eating it too.

http://www.youtube.com/watch?v=xa5aHGZnGC0

^ That link is just a benchmark for AMD's althlon 2 640 CPU (a quad core) it hits 37GFLOPs, and oddly enough, it has no problem playing 360 ports.

In fact, Crysis 3 requires a 2.8GHz 2 core processor, I couldn't quickly find the Gflops for such a CPU, but I did find one for this 3.2GHz althlon 2 260: 23.5GFLOPs.

Since this can run Crysis 3, I see no reason that the Wii U would have to make up for Xenon's theoretical numbers.
 
There are no magic solutions. In therms of efficiency, though, since you brought it up in your original post, allow me:

A single core from a Bloomfiled@3.2GHz here produces 10 GFLOP in a basic matmul synthetic test (same test I used for the Broadway/Ontario comparison earlier). In a similar matmul test (not even CL-based but GL-based, deduce points for rasterizer/TMU overhead) a 80-shader Evergreen@400MHz produces 13 GFLOP. The Bloomfiled (a Xeon W3565) is rated at 130W TDP, which equates to 32.5W per core. The Evergreen is a part of a 9W APU (2x Bobcats + 80-shader Evergreen). While both tests are synthetic, the GPU one actually multiplies two matrix vectors worth of 4MB each, so there's memory access patterns involved. The CPU test multiplies 128 bytes worth of matrices. Even at such favorable conditions for the CPU test, the power-efficiency advantage of the GPU is overwhelming.

You can say that matmul is too rudimentary a workload, and is already largely delegated to GPUs nowadays, and yet gamecode still does tons of matmuls today, for stuff ranging from physics, to skeletal animations, to visibility culls.

OK, but that 80 shader Evergreen's theoretically capable of around 64 GFLOPs at that frequency, right? That Bloomfield core has a theoretical maximum of, what, 18GFLOPs? So it takes more than 3 times as many theoretical FLOPs on the on the GPU to be only 30% faster than the CPU.

GPGPU is not a magic solution, but it is a solution that should be applicable in many (but not all) the circumstances where Cell's SPEs were used.

Which brings us back to my original point: when we compare Wii U's GPU to 360 and in particular PS3, do we need to account for the 100+ "missing" CPU GFlops? If we do refer to GPGPU in the CPU comparison (which I think most people agree on), then I think we have to. Otherwise it's a matter of having your cake and eating it too.

Well, the quick and dirty math above would suggest you'd basically need a 300 GFLOP GPU to make up for a missing 100+ GFLOPs of CPU performance. That would leave, optimistically, about 100GFLOPs for the WiiU's GPU to use for actual graphics rendering.
 

z0m3le

Banned
OK, but that 80 shader Evergreen's theoretically capable of around 64 GFLOPs at that frequency, right? That Bloomfield core has a theoretical maximum of, what, 18GFLOPs? So it takes more than 3 times as many theoretical FLOPs on the on the GPU to be only 30% faster than the CPU.



Well, the quick and dirty math above would suggest you'd basically need a 300 GFLOP GPU to make up for a missing 100+ GFLOPs of CPU performance. That would leave, optimistically, about 100GFLOPs for the WiiU's GPU to use for actual graphics rendering.

interesting, so what about my post above yours? PCs use CPUs that don't push Xenon flops numbers, but they clear those ports with flying colors... What gives?
 

Oblivion

Fetishing muscular manly men in skintight hosery
Just to be clear, the Wii-U GPGPU is just simply the GPU, right? It's not a separate chip distinct from the CPU and GPU, right?
 

Oblivion

Fetishing muscular manly men in skintight hosery
Yup.
It's the same GPU that people are betting to be <500GFLOP chip.

Lame to the max.


Also, I finished reading that DF comparison article on Black Ops, and I found it fascinating how they said that while the Wii-U wasn't in any way superior to the 360 version, it matched and in some cases exceeded certain areas when compared to the PS3 version. Found that pretty weird considering everyone said Cell was a beast.
 

blu

Wants the largest console games publisher to avoid Nintendo's platforms.
OK, but that 80 shader Evergreen's theoretically capable of around 64 GFLOPs at that frequency, right?
Correct.

That Bloomfield core has a theoretical maximum of, what, 18GFLOPs? So it takes more than 3 times as many theoretical FLOPs on the on the GPU to be only 30% faster than the CPU.
I don't know what the theoretical Bloomfield MADD performance is. But you missed one detail in the comparison: while the CPU test is approaching the ideal conditions for the CPU, the GPU test is not approaching the ideal conditions for the GPU. As I said, the GPU test is not even CL-based - there's no use of local data shares, everything goes exclusively via the TMUs, and the rasterizer likely uses some FLOPs as well. One cannot take the test results as the upper bounds of the GPU performance at matmul, and ergo, cannot make estimates of the CPU/GPU hypothetical performance ratios the way you do. I brought it up to show that despite any test condition disparities for the GPU side, it still trounces the CPU in things like power efficiency.

Well, the quick and dirty math above would suggest you'd basically need a 300 GFLOP GPU to make up for a missing 100+ GFLOPs of CPU performance. That would leave, optimistically, about 100GFLOPs for the WiiU's GPU to use for actual graphics rendering.
The quick and dirty math above was made on the wrong foot ; )
 

ozfunghi

Member
Which brings us back to my original point: when we compare Wii U's GPU to 360 and in particular PS3, do we need to account for the 100+ "missing" CPU GFlops? If we do refer to GPGPU in the CPU comparison (which I think most people agree on), then I think we have to. Otherwise it's a matter of having your cake and eating it too.

So... do we believe every or most game(s) currently available for 360 has the CPU maxed out? In a direct comparison with 360, we are seeing launch ports look as good on WiiU, but generally running worse (framerate). Could this be because the WiiU version was a launch port (with all its implications) and the 360 version was the lead platform, developers had been working on for over 7 years? Or is the only possible explanation that the CPU is too weak to handle it? I very much doubt that. And even if that were the case, why assume WiiU would need to sacrifice an entire 100Gflops of GPU performance in order to bridge the gap?

So in how many cases would the WiiU really need to rely on GPGPU functions, in order to run as good so that it can't look a fair bit better than the 360 version? 20%? 5%?
 

pottuvoi

Banned
Also, I finished reading that DF comparison article on Black Ops, and I found it fascinating how they said that while the Wii-U wasn't in any way superior to the 360 version, it matched and in some cases exceeded certain areas when compared to the PS3 version. Found that pretty weird considering everyone said Cell was a beast.
If RSX/GDDR3 bandwidth is the limiting factor in actual ROP/Shader operations, there is nothing that Cell can do to help. (IE. transparent surfaces.)
 

Durante

Member
Isn't calling something "a gpgpu" incorrect anyway? Shouldn't it just be a 'GPU with GPGPU functionality'?
YES! I've more or less given up arguing this point, since everyone uses it these days, but it really used to annoy me. Maybe things like a modern Tesla that doesn't even have a graphics output should be called a "GPGPU", but certainly not normal GPUs that are occasionally used to perform some general purpose computations.


Also, I finished reading that DF comparison article on Black Ops, and I found it fascinating how they said that while the Wii-U wasn't in any way superior to the 360 version, it matched and in some cases exceeded certain areas when compared to the PS3 version. Found that pretty weird considering everyone said Cell was a beast.
Call of Duty does not seem like a particularly CPU-intensive franchise.
 
Call of Duty does not seem like a particularly CPU-intensive franchise.

It's moreso than some, but not much (hell, Broadway could run it half-decently). I say, not by the next IW installment but the next Treyarch installment, we should see performance improve.
 

Thraktor

Member
I think docking 50-100Gflops for "GPGPU" functionality is fair enough, particularly when comparing to PS3, as the SPE tasks are the type of things most likely to be offloaded to the GPU on Wii U. That said, from my reading on SPE usage by studios like Guerrilla and Naughty Dog, it seems most of it's dedicated to tasks which would normally fall under plain old GPU functionality, particularly lighting.

I suppose I'd be interested to get a quote from a developer saying "we're using about X% of our GPU power for physics, etc.", to give us a better idea of the real-world efficiency for these things.
 

ozfunghi

Member
A question for Blu, Alstrong etc... considering the size of the gpu and taking into account eDRAM size, power consumption etc... what range is likely for the SPU/ALU amount?
 

blu

Wants the largest console games publisher to avoid Nintendo's platforms.
A question for Blu, Alstrong etc... considering the size of the gpu and taking into account eDRAM size, power consumption etc... what range is likely for the SPU/ALU amount?
I'll kindly leave that to Al, as my fabbing knowledge is too cursory to allow me to produce an estimate that wold not be widely off the mark. I could try to produce an educated guess based on past embedded parts I have worked with, but my memory is not my forte these days, so I can't even recall the fabbing nodes of those parts :/
 

ikioi

Banned
I'm saying there are good use examples for very specific, and limited work cases for GPGPU, but you can't expect to offload everything from the CPU to the GPU because in order to even hope to match the performance of the 360 and PS3 CPUs, you would have to reduce the actual graphics work load to almost nothing. In essence I'm saying what a number of us have been saying all along: GPGPU is not a magic solution to all the WiiU's shortcomings. It's not even a half measure solution.

I agree, GPGPU is not a magical sollution to all situations.

IMHO however, you are not giving enough credit to the capabilities of modern GPU architecture to handle many of these tasks more efficiently and faster then Xenon and Cell. Also there seems to be some confusion about what defines a GPGPU task.

Xenon and Cell were frequently used by developers to co assist in graphical related processing. Lighting, SIMD, depth of field, AA, post processing effects, shadowing, list goes on. These tasks are not necessarily GPGPU, many are absoltuely not, but rather standard GPU tasks that developers offloaded to Cell and Xenon. So Nintendo bringing these tasks back onto the GPU does not mean its being done via GPGPU or that there will be a significant performance hit doing so.

Modern GPU architecture has been designed to allow the GPU to simultaneously handle a diverse work loads like those above. Yes the GPU would take a hit having to do all these tasks that traditionally Cell and Xenon assisted with, but Nintendo and ATi can very easily offset utelizing modern architecture like increased SIMD and shader cores, SRFs, ROPs etc. Modern GPU architecture is quite complex with multiple pipelines, multiple cores for various tasks, and also increased programambility to tap raw power. It's not as easy as saying if the Wii U's GPU needs to do SIMD processing that Cell and Xenon did, that the GPU is going to take a performance hit from its graphical abilities. It all depends on the Wii U's GPU architecture as to how well it's able to handle the diverse range of tasks it's been loaded with. All we know is the architecture and technology exists to build a GPU that is more then capable of handling a diverse workload without sacrificing X to process B. Wwhether Nintendo and ATi implamented it into the Wii U is something i cannot answer.

Modern GPU architecture is designed around the idea of it being able to processes and load itself up with multiple different tasks, to be able to handle physics, lighting, depth of field, and SIMD, all while doing the traditional GPU grunt work. Go look at modern PC games, you'll find for the most part the GPUs are already handling this stuff and more.

The real question is how well have ATi and Nintendo designed the GPU architecture of the Wii U? If designed correctly the Wii U's GPU should be easily able to handle all GPU assisted tasks Cell and Xenon did, as well as the traditional GPU processing. There's no question that the Wii U's CPU is not going to be able to match Cell and Xenon at these tasks, but we do NOT know how capable the GPU is them. But we can look at modern GPU architecture in PC from Nvidia and ATi to gauge the likelyhood. They tell us that modern GPUs are more then capable of handling all the GPU related tasks Cell and Xenon assist with.

The architecture Nintendo appear to have gone with for the Wii U is all about efficiency. The Xbox 360 and PS3's processors handled I/O, audio, security, SIMD, and assisted the GPU with graphical processing, and then did 'typical' processing on top. With the Wii U Nintendo have offloaded sound, I/O, security, and opperating system to their own dedicicated processors/silicon. They also appear to have offloaded SIMD and graphic related tasks back onto the GPU. As such does the Wii U's CPU need to be as beefy as Cell and Xenon when it seems like it's not doing anywhere near the level of work?

Tear downs have shown the Wii U's CPU is around 1/3rd the transistor count of the Xenon CPU in the Xbox 360. Does anyone know the transistor count of the Wii U's GPU vs Xbox 360 and PS3? That's something i'd be very keen to find out.

If we combined the transistor count of the Wii U's CPU, DSP, I/O processor, ARM seceurity processor, and ARM O/S processor. I wonder how it would compare to Xenon and Cell in pure transistor count. Simply put the Wii U's architecture is radically different from the Xbox 360 and PS3, both of those systems favoured using the CPU heavily to assist with everything from security, I/O, sound, opperating system, through to graphics. Nintendo have left the CPU to do very specific tasks, with other sub processors or silicon to do the rest.


I think docking 50-100Gflops for "GPGPU" functionality is fair enough, particularly when comparing to PS3, as the SPE tasks are the type of things most likely to be offloaded to the GPU on Wii U. That said, from my reading on SPE usage by studios like Guerrilla and Naughty Dog, it seems most of it's dedicated to tasks which would normally fall under plain old GPU functionality, particularly lighting.

I suppose I'd be interested to get a quote from a developer saying "we're using about X% of our GPU power for physics, etc.", to give us a better idea of the real-world efficiency for these things.

Exactly my point. Some of the heaviest use of Cell we've seen in games has been for GPU related tasks. Tasks that with modern architecture a GPU would handle on its own.
 
YES! I've more or less given up arguing this point, since everyone uses it these days, but it really used to annoy me. Maybe things like a modern Tesla that doesn't even have a graphics output should be called a "GPGPU", but certainly not normal GPUs that are occasionally used to perform some general purpose computations.

Generally I think the term "GPGPU" refers to a technique, not a device. GPGPU is just the act of using a GPU for non-graphics computations. It doesn't require anything special from the GPU.
 

Durante

Member
Generally I think the term "GPGPU" refers to a technique, not a device. GPGPU is just the act of using a GPU for non-graphics computations. It doesn't require anything special from the GPU.
Yes, exactly, which is why it still vaguely annoys me when people use it to refer to hardware parts. Though it is true that modern GPU architectures have been tweaked explicitly to improve performance for GPGPU workloads. Nvidia stated with that a bit earlier than AMD did.

Exactly my point. Some of the heaviest use of Cell we've seen in games has been for GPU related tasks. Tasks that with modern architecture a GPU would handle on its own.
Indeed, and it would likely handle them well. But it still doesn't handle them "for free" -- that was my point.
 

Donnie

Member
Tear downs have shown the Wii U's CPU is around 1/3rd the transistor count of the Xenon CPU in the Xbox 360. Does anyone know the transistor count of the Wii U's GPU vs Xbox 360 and PS3? That's something i'd be very keen to find out.

Wasn't 90nm Xenon 165mm2? Surely its well under 90mm2 on a 45nm process? (technically should be 41.25mm2 in a perfect world).

How big is Vejle? (360's 45nm combined CPU/GPU)
 

Thraktor

Member
The most recent 45nm IBM CPU that I can find die size and transistor count info for is BlueGene/Q, which squeezes in roughly 4 million transistors per mm², so for Espresso's ~32.76mm² we're looking at around 130 million transistors. For comparison, Xenon is 165 million transistors, and Cell is 241 million transistors.

AMD's 40nm GPUs seem to get around 6.25 million transistors per mm², so for Latte's 156.21mm², you're looking at a total just shy of a billion transistors, around 975 million or so. This includes GPU, eDRAM, an ARM chip and a DSP, and we don't know the exact breakdown between those components, but we expect the GPU and eDRAM to take up the vast majority of the die. For comparison, the Xenos GPU consists of two dies totalling 337m transistors and the RSX GPU is a 300 million transistor part.

Edit: The above assumes that Latte is a 40nm part, which is certainly the most likely scenario, but we don't have any hard evidence one way or the other on the manufacturing process of the Latte die. In comparison, IBM have confirmed that Espresso is a 45nm part.
 
The most recent 45nm IBM CPU that I can find die size and transistor count info for is BlueGene/Q, which squeezes in roughly 4 million transistors per mm², so for Espresso's ~32.76mm² we're looking at around 130 million transistors. For comparison, Xenon is 165 million transistors, and Cell is 241 million transistors.

AMD's 40nm GPUs seem to get around 6.25 million transistors per mm², so for Latte's 156.21mm², you're looking at a total just shy of a billion transistors, around 975 million or so. This includes GPU, eDRAM, an ARM chip and a DSP, and we don't know the exact breakdown between those components, but we expect the GPU and eDRAM to take up the vast majority of the die. For comparison, the Xenos GPU consists of two dies totalling 337m transistors and the RSX GPU is a 300 million transistor part.

Edit: The above assumes that Latte is a 40nm part, which is certainly the most likely scenario, but we don't have any hard evidence one way or the other on the manufacturing process of the Latte die. In comparison, IBM have confirmed that Espresso is a 45nm part.

Any ideas how many transistors 32mb edram would be?
 

Thraktor

Member
Any ideas how many transistors 32mb edram would be?

We had some discussions on that a few pages ago. From info from Renesas, the eDRAM itself should come to 16.1mm² (if I recall correctly), but apparently there's a large overhead required for wiring/control/etc, so it could be 2x or more that that figure once the overhead's taken into account. Somewhere around 35mm² would probably be a good bet. By my calculations above, that'd be about 219 million transistors.
 
We had some discussions on that a few pages ago. From info from Renesas, the eDRAM itself should come to 16.1mm² (if I recall correctly), but apparently there's a large overhead required for wiring/control/etc, so it could be 2x or more that that figure once the overhead's taken into account. Somewhere around 35mm² would probably be a good bet. By my calculations above, that'd be about 219 million transistors.

So depending on the dsp and arm chip we could be looking at at least twice the number of transistors for the gpu
 

Donnie

Member
We had some discussions on that a few pages ago. From info from Renesas, the eDRAM itself should come to 16.1mm² (if I recall correctly), but apparently there's a large overhead required for wiring/control/etc, so it could be 2x or more that that figure once the overhead's taken into account. Somewhere around 35mm² would probably be a good bet. By my calculations above, that'd be about 219 million transistors.

Memory tends to take up less die space than logic so probably more like 250-260m or so. You're estimate on the number of transistors on the chip was pretty similar to my thoughts BTW. Though I was thinking just over 1 billion, mostly because of eDRAM taking up less space per transistor. The fact that WiiU has so much eDRAM should give it a higher ratio of transistors to die size than a traditional AMD GPU.

So depending on the dsp and arm chip we could be looking at at least twice the number of transistors for the gpu

Likely 2-3x as many, Xenos is about 250m transistors if you remove the eDRAM (including the logic on the daughter die as part of the 250m obviously).

If we assume the 975m figure is correct (obviously its just an estimate but just for arguments sake). Even if we removed 300m transistors for eDRAM (seems over the top) and 50M for the ARM chip and DSP (no chance they'll be even half that size) you're still left with about 2.5x as many transistors as Xenos.
 

Thraktor

Member
So depending on the dsp and arm chip we could be looking at at least twice the number of transistors for the gpu

The DSP and ARM chips should be very small indeed. A dual-core ARM Cortex A5 (which many suspect as the ARM chip in question) is 1.8mm² at 40nm. The DSP is more difficult to figure out, as there are a wide variety of DSPs offered from many different companies, but CEVA (whose DSP is in the 3DS) make a DSP called the TeakLite III, which might be suitable for Wii U, which fits in 0.47mm² at 65nm, so even less than that at 40nm. All in all, we might be looking at 735 million transistors or so for the GPU itself.

Also, keep in mind that the Xenos transistor count includes eDRAM too.

Edit: Incidentally, Turks Pro (although a Northern Islands part) comes to 716 million transistors with a 480:24:8 configuration. Given that I'd expect a slightly more ROP-heavy design (due to the extra screen), and as R700 is a little less transistor-intensive than Northern Islands, a 480:24:12 configuration seems reasonable to me for Latte.

Edit 2: Using my handy chart, that comes to 528 Gflops, a 13.2GT/s texture fillrate and a 6.6GP/s pixel fillrate.
 

Argyle

Member
Also SPUs are heavily used for non graphics tasks on games and should get quite decent performance over what is possible with PPU. (Which apparently is quite weak on ps3.)

I'm not saying they are useless, just not as useful as it might seem on paper. From what I've heard it's often difficult to reach even Xenon levels.

Slightly OT, but I agree with pottuvoi, this is completely not true - if you have the same code running on an SPU vs. one of the threads on a Xenon, the SPU will smoke the Xenon, every single time.

I see people call the SPUs "DSPs" which is a task they are good at, but they are surprisingly fast at running just about any code you throw at them, in my experience.
 
Slightly OT, but I agree with pottuvoi, this is completely not true - if you have the same code running on an SPU vs. one of the threads on a Xenon, the SPU will smoke the Xenon, every single time.

Care to elaborate why this should be so? For instance, the SPUs lack branch prediction and a Xenon core can reach more GFLOPs than a SPU.

I see people call the SPUs "DSPs" which is a task they are good at, but they are surprisingly fast at running just about any code you throw at them, in my experience.

Sorry, I just find that hard to believe. No OoOE, lack of hardware-based branch prediction, limited 256KB local storage, difficult memory transfers.. running general purpose code on Cell that is not specifically optimized for its architecture should lead to weak performance in many cases.
 

Thraktor

Member
Care to elaborate why this should be so? For instance, the SPUs lack branch prediction and a Xenon core can reach more GFLOPs than a SPU.

Sorry, I just find that hard to believe. No OoOE, lack of hardware-based branch prediction, limited 256KB local storage, difficult memory transfers.. running general purpose code on Cell that is not specifically optimized for its architecture should lead to weak performance in many cases.

I think we're running a little off topic here, but my understanding is (and I'll stress that I've never actually coded on Cell, I'm just basing this on what I've read about the architecture) that coding for Cell is centred around optimising for that 256K local store. If your code is only operating on that 256K of SRAM (which for the most part it should be), then you're in a situation that's functionally equivalent to having a 100% cache hit rate, which means OoOE and branch prediction are of little value. Of course the limitation of the architecture is that you have to optimise around that 256K, but optimising around memory constraints is something game programmers have been doing since Space War.

Memory tends to take up less die space than logic so probably more like 250-260m or so. You're estimate on the number of transistors on the chip was pretty similar to my thoughts BTW. Though I was thinking just over 1 billion, mostly because of eDRAM taking up less space per transistor. The fact that WiiU has so much eDRAM should give it a higher ratio of transistors to die size than a traditional AMD GPU.

Thats's a fair point. We're probably around the 1 billion mark, give or take 50 to 100 million.
 
I think we're running a little off topic here, but my understanding is (and I'll stress that I've never actually coded on Cell, I'm just basing this on what I've read about the architecture) that coding for Cell is centred around optimising for that 256K local store. If your code is only operating on that 256K of SRAM (which for the most part it should be), then you're in a situation that's functionally equivalent to having a 100% cache hit rate, which means OoOE and branch prediction are of little value. Of course the limitation of the architecture is that you have to optimise around that 256K, but optimising around memory constraints is something game programmers have been doing since Space War.

No cache misses don't save you from pipeline stalls though (same for data dependencies). Of course you can still use branch hints and implement software pipelining, but again that's where lots of manual tweaking starts. And we were talking about "just about any code thrown at them".

Mind you, I won't deny that a single SPU can be quite powerful given the right task and/or lots of optimizing, and of course Cell as a whole can be more powerful than Xenon by a fair amount. But from what I've heard and what the reality of many multi-platform games show is that for typical CPU game code, Cell isn't better often enough. It's better at supporting the weak RSX though - no question!


And sorry if this too much off topic here. I'll stop if anyone feels distracted by this.
 

Argyle

Member
Sorry, I just find that hard to believe. No OoOE, lack of hardware-based branch prediction, limited 256KB local storage, difficult memory transfers.. running general purpose code on Cell that is not specifically optimized for its architecture should lead to weak performance in many cases.

I think we're running a little off topic here, but my understanding is (and I'll stress that I've never actually coded on Cell, I'm just basing this on what I've read about the architecture) that coding for Cell is centred around optimising for that 256K local store. If your code is only operating on that 256K of SRAM (which for the most part it should be), then you're in a situation that's functionally equivalent to having a 100% cache hit rate, which means OoOE and branch prediction are of little value. Of course the limitation of the architecture is that you have to optimise around that 256K, but optimising around memory constraints is something game programmers have been doing since Space War.

This is exactly why - the trick is to get your data parceled out properly, but once it does, plain old C++ code will run faster on the SPU than on a Xenon thread. The good thing is that doing the work to ensure you are working on small chunks of data at a time for the SPU also tends to optimize the Xenon version as well (as it should sit in cache better, so it runs better than a naive implementation that ends up traversing all over main memory).
 

mrklaw

MrArseFace
Yes, exactly, which is why it still vaguely annoys me when people use it to refer to hardware parts. Though it is true that modern GPU architectures have been tweaked explicitly to improve performance for GPGPU workloads. Nvidia stated with that a bit earlier than AMD did.

Indeed, and it would likely handle them well. But it still doesn't handle them "for free" -- that was my point.


I don't know how modern GPUs are laid out, so please educate me if needed. But aren't most GPU pipelines overkill for compute tasks?

So if someone (like Nintendo) wanted to specifically have some additional computing horsepower on tap, suited towards gaming, could they have a more stripped down section of GPU for that?

Alternatively, if you weren't expecting any compute only tasks on your GPU, wouldn't you balance the rendering parts (ROPs?) with the vertex/texturing parts? So you wouldn't have an excess of vertex/texture units. But if you did want to encourage GPGPu usage, you might have an otherwise disproportionate number of processing units compared to ROPs
 

ikioi

Banned
AMD's 40nm GPUs seem to get around 6.25 million transistors per mm², so for Latte's 156.21mm², you're looking at a total just shy of a billion transistors, around 975 million or so. This includes GPU, eDRAM, an ARM chip and a DSP, and we don't know the exact breakdown between those components, but we expect the GPU and eDRAM to take up the vast majority of the die. For comparison, the Xenos GPU consists of two dies totalling 337m transistors and the RSX GPU is a 300 million transistor part.

So sounds like the Wii U's actual GPU, excluding the ARM, DSP, and eDRAM, still has around 3x or more transistors then the GPU in the Xbxo 360 and PS3. Combined with more modern architecture to boot.

Also sounds like the total transitor count is higher in the Wii U then Xbox 360 and PS3.

I firmly believe we are yet to see any game take advantage of the Wii U's architecture in full. The GPU is clearly designed to pickup on a lot of the graphical tasks Xenon and Cell were used for, and it sound like its a lot more capable chip. The lack of power on the CPU seems to be in part due to the GPU, DSP, and various ARM processors handling a lot of tasks Cell and Xenon were taxed with. So for developers, particularly 3rd party, their games and engines would need moderate to significant reworking to take full advantage of the Wii U's horepower.
 

Thraktor

Member
No cache misses don't save you from pipeline stalls though (same for data dependencies). Of course you can still use branch hints and implement software pipelining, but again that's where lots of manual tweaking starts. And we were talking about "just about any code thrown at them".

Mind you, I won't deny that a single SPU can be quite powerful given the right task and/or lots of optimizing, and of course Cell as a whole can be more powerful than Xenon by a fair amount. But from what I've heard and what the reality of many multi-platform games show is that for typical CPU game code, Cell isn't better often enough. It's better at supporting the weak RSX though - no question!


And sorry if this too much off topic here. I'll stop if anyone feels distracted by this.

My guess is that the cache miss penalties on Xenon are going to hurt you a lot more than pipeline stalls on a Cell SPE.

Anyway, while we are a little off-topic, it does relate to how developers are going to have to approach compute tasks on Latte, as there you're going to be doing a lot of data parcelling as well. That actually brings me to a question I have for people: do you think there's going to be more eDRAM on the GPU die than just the 32MB "MEM1"? When designing the Gamecube, Nintendo evidently went on a bit of a 1T-SRAM binge, as not only was it used for the main memory, but also Flipper's embedded framebuffer and texture cache. With Wii U, their memory of choice seems to be eDRAM, as they're using it both for their CPU cache and shoving a large chunk of it on-die with the GPU. Could they be doing the same thing with Latte?

The R700 by default has quite a few layers of SRAM on board. By my count there's:

Register memory banks
Local Data Shares
Global Data Share
Vertex Cache
Instruction Cache
Constant Cache
L1 Texture Cache
L2 Texture Cache

And possibly more that I'm not aware of. The 32K Global Data Share will probably be replaced outright by the MEM1 eDRAM, but the rest of them are still going to need to be there (and will total a couple of megabytes altogether by my reckoning). Do people think Nintendo are going to go on an eDRAM binge in the same way they crammed in 1T-SRAM everywhere with the Gamecube? If so, which memory pools are feasible and likely to be replaced with eDRAM? If I recall correctly, Matt once said something about register memory being increased in Latte over standard R700, but would that not be a situation where SRAM's latency would outweigh eDRAM's density?
 

Argyle

Member
Hmmm. I don't think this is true.

Shrug, it has been true for all the code I have written, as well as just about all of my colleague's code that I am aware of. What is the context of that slide? I'm sure you could make the SPU code run slower by doing something silly like implementing a software cache or otherwise not optimizing for the local store...
 

blu

Wants the largest console games publisher to avoid Nintendo's platforms.
My guess is that the cache miss penalties on Xenon are going to hurt you a lot more than pipeline stalls on a Cell SPE.

Anyway, while we are a little off-topic, it does relate to how developers are going to have to approach compute tasks on Latte, as there you're going to be doing a lot of data parcelling as well. That actually brings me to a question I have for people: do you think there's going to be more eDRAM on the GPU die than just the 32MB "MEM1"? When designing the Gamecube, Nintendo evidently went on a bit of a 1T-SRAM binge, as not only was it used for the main memory, but also Flipper's embedded framebuffer and texture cache. With Wii U, their memory of choice seems to be eDRAM, as they're using it both for their CPU cache and shoving a large chunk of it on-die with the GPU. Could they be doing the same thing with Latte?

The R700 by default has quite a few layers of SRAM on board. By my count there's:

Register memory banks
Local Data Shares
Global Data Share
Vertex Cache
Instruction Cache
Constant Cache
L1 Texture Cache
L2 Texture Cache

And possibly more that I'm not aware of. The 32K Global Data Share will probably be replaced outright by the MEM1 eDRAM, but the rest of them are still going to need to be there (and will total a couple of megabytes altogether by my reckoning). Do people think Nintendo are going to go on an eDRAM binge in the same way they crammed in 1T-SRAM everywhere with the Gamecube? If so, which memory pools are feasible and likely to be replaced with eDRAM? If I recall correctly, Matt once said something about register memory being increased in Latte over standard R700, but would that not be a situation where SRAM's latency would outweigh eDRAM's density?
I'm ultra skeptical about the register file using anything but (multiported) SRAM. But I already disclaimed any proficiency in the subject, so I'll shut up : )
 
Top Bottom