• Hey, guest user. Hope you're enjoying NeoGAF! Have you considered registering for an account? Come join us and add your take to the daily discourse.

Wii U CPU |Espresso| Die Photo - Courtesy of Chipworks

Orionas

Banned
C10234F5_PkgDecap.jpg


A question of mine.. What is the little 3rd chip? A guy in a forum said "is a helping gpu" that also original Wii got?? Hmmm I think he recalled a name like "starlet" ? Is it accurate? or its something else.
 

Jaagen

Member
I asked about this because I'm wondering about the feasibility of a future Nintendo product that has the Wii U's chipset on the actual Gamepad... with the game disc-data streaming from the console or even over the internet. Call it 'Wii Cloud', if you will.

It is higher than the 3DS. I think the whole 3DS power consumption must be around 5 watts. iPad 4 is sub 10 watts for the whole system I believe.

C10234F5_PkgDecap.jpg


A question of mine.. What is the little 3rd chip? A guy in a forum said "is a helping gpu" that also original Wii got?? Hmmm I think he recalled a name like "starlet" ? Is it accurate? or its something else.

The teory is that it's EEPROM.
 

wsippel

Banned
A question of mine.. What is the little 3rd chip? A guy in a forum said "is a helping gpu" that also original Wii got?? Hmmm I think he recalled a name like "starlet" ? Is it accurate? or its something else.
It's flash memory. The Wii had it as well, but I believe Nintendo used an EEPROM instead of flash back then. No idea what's stored in there. Maybe license keys for eShop games or something.
 

Thraktor

Member
C10234F5_PkgDecap.jpg


A question of mine.. What is the little 3rd chip? A guy in a forum said "is a helping gpu" that also original Wii got?? Hmmm I think he recalled a name like "starlet" ? Is it accurate? or its something else.

It's NOR flash, basically a replacement for the EEPROM on the Wii GPU package. Starlet, by the way, was actually on the Hollywood die itself.
 

Donnie

Member
Yeah, Anand from Anandtech said the Intel HD2500 graphics were about half as powerful as the 360s, and the HD4000 was more powerful. So with Haswell, even integrated graphics low enough power for a tablet will likely be more powerful.

Haswells GT3e with embedded DRAM on the GPU (sound familiar?) should post even bigger leads.

How did Anand come to the conclusion that HD2500 is about half as powerful as 360? Isnt that GPU about 53gflops when clocked at its maximim 1100mhz (for the mobile version)? With the HD4000 being around 140gflops. Or am I missing something?
 

AlStrong

Member
The package size is the same, but that's no way to tell whether the die has shrunk or not (particularly as there are actually two dies in the package). The Wii Mini has both noticeably lower power consumption and a noticeably smaller cooling system, and a die shrink of just Broadway wouldn't be able to explain that.

Surely I'm not the only one noticing the column of capacitors & a bunch of resistors in the top right missing for Wii Mini. :p
 

tipoo

Banned
How did Anand come to the conclusion that HD2500 is about half as powerful as 360? Isnt that GPU about 53gflops when clocked at its maximim 1100mhz (for the mobile version)? With the HD4000 being around 140gflops. Or am I missing something?

HD4000 = 2 ROPS, 25.6gbps memory bandwidth, 1.15ghz 16EU gpu cores, ~256gflops single precision iirc? That would put it just above the 360 (ignoring higher efficiency from new shaders completely)
Maybe you're thinking the old HD3000?

Think about this, even the GPUs in the new iPad are hitting 77Glfops, 53 for the 2500 sounds way too low. Half of 256 I would believe, if it scales perfectly with the EU cores.

http://www.anandtech.com/show/6472/ipad-4-late-2012-review/4
 
The package size is the same, but that's no way to tell whether the die has shrunk or not (particularly as there are actually two dies in the package). The Wii Mini has both noticeably lower power consumption and a noticeably smaller cooling system, and a die shrink of just Broadway wouldn't be able to explain that.
Pardon me for being stubborn, I know it's an MCM, but without seeing a size reduction I'm not convinced about a core shrink, on the surface it looks the same.

Granted, from 17W to 13 W go 4W, that's the whole 90 nm CPU consumption and core shrink shaved 1 W at most, then again logic seems simplified, no SD card reader, way less capacitors and DDR3 could be low voltage now. They could also have shrinked just one of the chips in there, the one with 24 MB 1T-SRAM and the sound chip; as it would be the simpler one to do. The DVD drive and controller chips for it might also have improved (and that might be where the majority of the energy is being saved. Other argument is how the other consoles were also 90 nm at launch and both wasted around 200W, now they're in the 70W ballpark going from 90 nm to 45 nm, a 4W reduction is measly by comparison, I'm sure other consoles could cut that without touching the CPU (X360's Hama chip for instance).


I also doubt that, if said changes were made, they were specifically or coinciding with Wii Mini launch, they're probably on every recently manufactured Wii, perhaps ever since the Wii sans-GC retrocompatibility version. Waiting for the Wii Mini would be too late in the cycle for it to make sense, specially considering sales were really dimming at that point.

But anyway, this (my stubborness, that is) detracts from my point which is I'd sure like to see wether the die for the core shrinked Broadway is identical to the launch one, the scenario of engineering something in between seems like a really clever thing to do, even though there's little to support it.

EDIT: Just remembered, this Wii mini seems to be missing 480p, so I'm guessing they took out the AV encoder chip that did that, that's an extra chip missing and one that does something that could take a little wattage.
 
Thraktor, on close analysis I think it's safe to say that Hollywood really didn't shrink:



The size of the capacitors and RAM chip are consistent; Hollywood stays the same. Broadway is the only one being core shrinked there.

I don't know about the core being shrunk. The core itself has always been small. I'd hazard a guess that it was the packaging that shrunk, but not the core.

20061124willchip2.jpg
 

Donnie

Member
HD4000 = 2 ROPS, 25.6gbps memory bandwidth, 1.15ghz 16EU gpu cores, ~256gflops single precision iirc? That would put it just above the 360 (ignoring higher efficiency from new shaders completely)
Maybe you're thinking the old HD3000?

Think about this, even the GPUs in the new iPad are hitting 77Glfops, 53 for the 2500 sounds way too low. Half of 256 I would believe, if it scales perfectly with the EU cores.

http://www.anandtech.com/show/6472/ipad-4-late-2012-review/4

I was going with:

EU * 4 [dual-issue x 2 SP] * 2 [multiply + accumulate] * clock speed

Which I'm pretty sure was correct for Gen 6 Intel graphics, Gen 7 must be different?
 

tipoo

Banned

krizzx

Junior Member
So, I have a few questions about the performance.

1. Do we know how many instructions it executes per cycle?

2. When it executes them, say for 3 per cycle for exmaple, is that 3 on each individual core?

3. What are the physics capabilities of this CPU? I ask because I remember people saying that the Wii U's CPU wouldn't be able to do physics very well but I remember a launch Wii game that did this them exceptionally well. http://www.youtube.com/watch?v=3xJXvFqhCk0 http://www.youtube.com/watch?v=v-F3UgiEexY http://www.youtube.com/watch?v=402w0VWnMDk
If we go back to pure "at a glance" performance increase the CPU should be 6X stronger minimum.


4. What would the Wii U CPU "not" be able to do that the 360 CPU could and vice versa?

5. Is it possible to up the clock on the CPU by voltage stepping? In other words, could Nintendo release a firmware update that ran the CPU at a higher clock?

6. How much of a problem would the CPU be for AI? I recall Samurai Warriors 3 and Sengoku Basara 3 running extremely well on the Wii but Warriors Orochi Hyper running terrible, though it didn't appear to be doing much more than those 2 games.

I wonder just how much of the problems we've seen in Wii U ports are the result of specs and how much are the result of developer effort(or lack there of).
 
1. Do we know how many instructions it executes per cycle?
It's capable of issuing three instructions per clock cycle into six independent execution units. But it's not that linear, never is.


So here:




• Up to four instructions can be fetched from cache per clock cycle
• As many as six instructions can execute per clock (including two integer instructions)
• Single-clock-cycle execution for most instructions

• Maximum three instruction dispatch per clock cycle
• Processes one branch per cycle and can resolve two speculations

• Completion unit retires as many as two instructions per clock
Source: http://raid-faq.narod.ru/doc/750_ts.pdf

That's documentation for a regular PPC750 part, so paired singles might introduce a difference by doing 2x32-bit SIMD and being able to either dispatch or retire more per clock, I dunno.
2. When it executes them, say for 3 per cycle for exmaple, is that 3 on each individual core?
Yes.

How it behaves with multithreaded code though is up for anyone's guess, and by that I mean the intricacies of doing so because the CPU architecture originally wasn't meant for it. But it's probably pretty straightforward otherwise we would have heard about it by now.
3. What are the physics capabilities of this CPU? I ask because I remember people saying that the Wii U's CPU wouldn't be able to do physics very well but I remember a launch Wii game that did this them exceptionally well.

If we go back to pure "at a glance" performance increase the CPU should be 6X stronger minimum.
It depends on the software you're using and how it is optimized. For you basically have two concurring approaches. Havok doesn't overly rely on GPU, although it does support it (more about that in a while) while PhysX by nature does rely on GPU but can run on CPU as well.

Anyway, theoretically using it for Physics is mostly a waste, not because it can't handle them, but because it's not as effective doing them as other parts; see, Physics calculations are calculations so it's mostly down to math co-processor/Floating Point Unit work.

Ever since Shader Model 2 that Physics and particle acceleration on GPU's is a somewhat viable possibility and something often talked about; these were limited proof of concept experiments that lead to Shader Model 3 onwards having actual customizations for it.

You can discern two types of physics (and particle calculations) though, gameplay/precision ones (ie: having to interact and take into account the game world geometry and concurring objects) and independent/imprecise/effect physics ones (stuff without collision detection basically, they do their thing but aren't necessarily aware of other objects/physics calculations going on). Makes sense?


Havok FX did separate execution layout like this (CPU doing precision, GPU doing independent); PhysX tries to do everything on GPU but can do things via CPU albeit with penalty; some games used it on the Wii and that was obviously CPU-only.

Then you have politics, as Havok was bought by Intel (and thus optimizing for CPU has been a core intention ever since) and PhysX has been bought by no other than Nvidia (so GPU performance is their core objective). It also helps that i5/i7 series of CPU's have increased floating point performance on cpu's in a huge maner; an advantage Wii U lacks. (making most physics calculations better suited to be offloaded to the GPU if possible)

PhysX on the PC nuked support for AMD cards making it so that physics have to run on CPU in a not-so-classy Nvidia move. (can be patched around, but I don't know the situation for consoles)
4. What would the Wii U CPU "not" be able to do that the 360 CPU could and vice versa?
GFlop throughput is different, so it can basically do less calculations. Thankfully the GPU can do more of those calculations and with less penalty than on X360.
5. Is it possible to up the clock on the CPU by voltage stepping? In other words, could Nintendo release a firmware update that ran the CPU at a higher clock?
Even if they could, don't hold your breath.
6. How much of a problem would the CPU be for AI? I recall Samurai Warriors 3 and Sengoku Basara 3 running extremely well on the Wii but Warriors Orochi Hyper running terrible, though it didn't appear to be doing much more than those 2 games.
Game code was most certainly a wreck as they're not exactly tech gods whose games push whatever (quite the opposite). And most likely on top of sloppy... optimized for the 2-way SMT nature of PS3/X360; that or optimized for floating point performance seeing I do believe PS3 is lead platform for that one and it lacks general purpose overhead in no small way, they wouldn't be the first one's stuffing AI into SPE's.

Samurai Warriors 3 was based on their PS2-tech. And Sengoku Basara is Capcom on MT Framework, way better coding talent and tech. (probably bigger budget too)

And one can't stress enough the fact that both those examples were done from the ground for the Wii (even if they didn't stay exclusive, it was lead platform). Same can't be said for Orochi 3 late porting shenanigans.
I wonder just how much of the problems we've seen in Wii U ports are the result of specs and how much are the result of developer effort(or lack there of).
In regards to current generation porting it's mostly a matter of the more you optimize/write around the quirks of a certain architecture the harder it is to port something.

And current gen platforms are not only quirky they're also very well known to developers for it, therefore at this point code is very optimized for them (mostly for X360, but I digress). It's literally years of optimization down the drain when going for very different architecture choices like the ones at hand and a lot of launch ports haven't taken that into account or took the time to neuter it. Also notice most problems we're seeing are most prominent on console-only products being ported over.

For instance, Need for Speed Most Wanted is better off being based on the PC version than on the PS3/X360 implementations, seeing cpu-wise the architecture is more traditional, hence Criterion not going through the pains Tekken and Warriors Orochi dudes went through.
 
For instance, Need for Speed Most Wanted is better off being based on the PC version than on the PS3/X360 implementations, seeing cpu-wise the architecture is more traditional, hence Criterion not going through the pains Tekken and Warriors Orochi dudes went through.
I know the PS4 and supposedly Durango are x86 based, so porting between the two and PC is more straight forward. Do you think the Wii U is much more difficult?
 
I know the PS4 and supposedly Durango are x86 based, so porting between the two and PC is more straight forward. Do you think the Wii U is much more difficult?
Shouldn't be, the cpu is very straightforward and that means it's very industry standard, that means everybody is coding primarily with it in mind.

On top of all that it's very what you see is what you get, with good coding, performance is very predicable with no problems like chronic cache miss or fpu lacking in precision (no code misbehaving/need to check for coherency), with the pipeline being so short cycles are really fast and in the advent of stalling/crash pipeline take much time to clean.

Other than good coding the only thing really specific in there is paired singles, needed to achieve the best possible floating point performance. On the other hand it's missing some modern commonalities like 256-bit wide AVX (SIMD) and the new found focus on multithreading.

AMD Jaguar has both, cluster-based multithreading and AVX. Multithreading can't be acounted for as standard on pc code because cpu's as recent as core 2 duo (and most AMD cpu's) lack it, and AVX would really help Wii U floating point performance (as it did for core i5/i7 and AMD Jaguar). Despite these being hese are closed platforms though, so specific optimization is viable.

That (floating point performance) and the number of cores available (3 vs 8) really sets these systems apart, porting up should be easy, porting down depends. Code expecting floating point performance is gonna have a hard time code meant for more than 3 cores too.

Other than that it's good, there's talk of PS4/X720 reserving two cores for cpu, so that brings it down to (3 vs 6). Of course Wii U being x86 would also help though, recompiling for other architecture is not an issue, but if somewhere down the road third part support doesn't get off the ground and middleware PPC supoort gets deprecated Wii U will forever be bundled with current gen platforms and technology, despite the PPC part at hand having more in common with the next gen x86 solutions than PS3/X360 in order execution+smt. Developers doing middleware recompilations on their own is possible of course, but most simply won't bother.

All this to say, not a problem "if" Wii U manages to keep third part support (and as the lowest denominator gets to be lead platform) throughout the generation. Otherwise competition will probably top it by at least twice the general purpose performance making late/second thought ports potentially hard to accomodate. Parity between both other competitors also doesn't help, I guess.
 
Shouldn't be, the cpu is very straightforward and that means it's very industry standard, that means everybody is coding primarily with it in mind.

On top of all that it's very what you see is what you get, with good coding, performance is very predicable with no problems like chronic cache miss or fpu lacking in precision (no code misbehaving/need to check for coherency), with the pipeline being so short cycles are really fast and in the advent of stalling/crash pipeline take much time to clean.

Other than good coding the only thing really specific in there is paired singles, needed to achieve the best possible floating point performance. On the other hand it's missing some modern commonalities like 256-bit wide AVX (SIMD) and the new found focus on multithreading.

AMD Jaguar has both, cluster-based multithreading and AVX. Multithreading can't be acounted for as standard on pc code because cpu's as recent as core 2 duo (and most AMD cpu's) lack it, and AVX would really help Wii U floating point performance (as it did for core i5/i7 and AMD Jaguar). Despite these being hese are closed platforms though, so specific optimization is viable.

That (floating point performance) and the number of cores available (3 vs 8) really sets these systems apart, porting up should be easy, porting down depends. Code expecting floating point performance is gonna have a hard time code meant for more than 3 cores too.

Other than that it's good, there's talk of PS4/X720 reserving two cores for cpu, so that brings it down to (3 vs 6). Of course Wii U being x86 would also help though, recompiling for other architecture is not an issue, but if somewhere down the road third part support doesn't get off the ground and middleware PPC supoort gets deprecated Wii U will forever be bundled with current gen platforms and technology, despite the PPC part at hand having more in common with the next gen x86 solutions than PS3/X360 in order execution+smt. Developers doing middleware recompilations on their own is possible of course, but most simply won't bother.

All this to say, not a problem "if" Wii U manages to keep third part support (and as the lowest denominator gets to be lead platform) throughout the generation. Otherwise competition will probably top it by at least twice the general purpose performance making late/second thought ports potentially hard to accomodate. Parity between both other competitors also doesn't help, I guess.

I just wanted to point out the bolded. Correct me if I'm wrong, but you can't say that Jaguar had 8 multi-threaded cores Vs Espresso's 3 single-threaded cores. can you? Jaguar doesn't have "real" multi-threading, it just pairs two cores together. So wouldn't Jaguar in PS4 really have 4 CBM cores (8 single threaded physical cores conjoined into pairs of 2) Vs Espresso's 3 single threaded cores?

I'm in no way implying that the Espresso is suddenly just as capable as the Jaguar 4 core CBM in total output, but not referencing this makes the gap look larger than it is. An 8 core CBM Jaguar would actually have 16 physical cores, right? That's a huge difference.

I guess what I'm saying is that PS4 has either 4 multithreading cores or 8 single thread cores, so if you're comparing advantages you can't list both at the same time. Then again, my knowledge of AMD chips is soley GAF based, lol.
 

krizzx

Junior Member
It's capable of issuing three instructions per clock cycle into six independent execution units. But it's not that linear, never is.


So here:

Source: http://raid-faq.narod.ru/doc/750_ts.pdf

That's documentation for a regular PPC750 part, so paired singles might introduce a difference by doing 2x32-bit SIMD and being able to either dispatch or retire more per clock, I dunno.Yes.

How it behaves with multithreaded code though is up for anyone's guess, and by that I mean the intricacies of doing so because the CPU architecture originally wasn't meant for it. But it's probably pretty straightforward otherwise we would have heard about it by now.It depends on the software you're using and how it is optimized. For you basically have two concurring approaches. Havok doesn't overly rely on GPU, although it does support it (more about that in a while) while PhysX by nature does rely on GPU but can run on CPU as well.

"three instructions per clock cycle into six independent execution units." I do not understand this. 3x6? Does does this mean 3 instructions into each of the 6 units or does this mean that it can execute 6 times per cycle?

Also, is this the documentation of what the Nintendo processor versions can do? I thought it had customizations made to it beyond the standard processor.
 
I just wanted to point out the bolded. Correct me if I'm wrong, but you can't say that Jaguar had 8 multi-threaded cores Vs Espresso's 3 single-threaded cores. can you?
Hmm, that wasn't my point, but it can somewhat be claimed and yet somewhat it shouldn't.

It shouldn't because it's not multithreading as we know it, I doubt you can hope to have a 30% boost, nor that there's that much a ceiling for 2-way to work here and make the cpu architecture that much more efficient.

But you're supposed to get double the threads for it, so it's either 8 threads or 16 real threads.
Jaguar doesn't have "real" multi-threading, it just pairs two cores together. So wouldn't Jaguar in PS4 really have 4 CBM cores (8 single threaded physical cores conjoined into pairs of 2) Vs Espresso's 3 single threaded cores?
That's seems true (I only don't say it is because I don't know what CBM stands for, cluster something? sorry for that) but other than the intricacy of the architecture being essentially dual-core pairs, paired into blocks, it still provides you with 8 cores at all times albeit cojoined; and Espresso provides 3.

I still reckon CMT as an advantage on a 8 core configuration, but like everything it might not be advantageous to use; even intel hyperthreading can actually slow things down if you use the second thread to run full load tasks on it, second thread is meant for overhead complementary stuff.

This implementation is a mystery in that regard, is second thread for complementary use or is it like splitting cpu load in half? Because on regular SMT you can at best get a 30% boost; if this is only dividing throughput that means you get double the threads and not necessarily more overhead for it.
I'm in no way implying that the Espresso is suddenly just as capable as the Jaguar 4 core CBM in total output, but not referencing this makes the gap look larger than it is. An 8 core CBM Jaguar would actually have 16 physical cores, right? That's a huge difference.
The Jaguar is still an 8 core solution at all times. 8 cpu blocks would mean 16 cpu logic units though, albeit paired.
I guess what I'm saying is that PS4 has either 4 multithreading cores or 8 single thread cores, so if you're comparing advantages you can't list both at the same time. Then again, my knowledge of AMD chips is soley GAF based, lol.
No, no. 8 threads at all times, 16 with CMT enabled.

At least if the CMT implementation is the same as on the Opterons; still, it would never decrease cpu threads.
 
"three instructions per clock cycle into six independent execution units." I do not understand this. 3x6? Does does this mean 3 instructions into each of the 6 units or does this mean that it can execute 6 times per cycle?
No, look at the diagram.

6 execution units is a mere detail. It fetches from cache up to 4 instructions per clock, providing you fetch four one will be left in queue for the next cycle since processor can only be fed three at once, the instructions then get dispatched (one of them being a conditional branch the other two speculative: as I get it speculative being branch prediction, so doing work ahead of time, so that later you might get to complete two branches per clock?), internally it has 6 execution units to accomodate them, you can either fill them with your 3 instructions or they might be busy completing some other instruction since not all can be completed in one cycle (in not too certain of this part). Either way seems like in the end the cpu can only write back (output) two completed instructions per clock.
Also, is this the documentation of what the Nintendo processor versions can do? I thought it had customizations made to it beyond the standard processor.
Yes, that would be the paired singles. I've looked into the PPC 750CL part (who has them) but saw no clear difference in specs.

I don't know if that changes anything on instruction throughput per cycle.
 
Hmm, that wasn't my point, but it can somewhat be claimed and yet somewhat it shouldn't.

It shouldn't because it's not multithreading as we know it, I doubt you can hope to have a 30% boost, nor that there's that much a ceiling for 2-way to work here and make the cpu architecture that much more efficient.

But you're supposed to get double the threads for it, so it's either 8 threads or 16 real threads.That's seems true (I only don't say it is because I don't know what CBM stands for, cluster something? sorry for that) but other than the intricacy of the architecture being essentially dual-core pairs, paired into blocks, it still provides you with 8 cores at all times albeit cojoined; and Espresso provides 3.

I still reckon CMT as an advantage on a 8 core configuration, but like everything it might not be advantageous to use; even intel hyperthreading can actually slow things down if you use the second thread to run full load tasks on it, second thread is meant for overhead complementary stuff.

This implementation is a mystery in that regard, is second thread for complementary use or is it like splitting cpu load in half? Because on regular SMT you can at best get a 30% boost; if this is only dividing throughput that means you get double the threads and not necessarily more overhead for it.The Jaguar is still an 8 core solution at all times. 8 cpu blocks would mean 16 cpu logic units though, albeit paired.No, no. 8 threads at all times, 16 with CMT enabled.

At least if the CMT implementation is the same as on the Opterons; still, it would never decrease cpu threads.
Gotcha, I was under the impression that jaguar in PS4 has four CPU clusters with 2 cores each with the hyper threading being per cluster. There are multiple threads per cluster, but only because there are two single threaded cores in that cluster. Essentially just AMDs way of claiming hyper threading without actually having it, so to speak.
CBM was my uneducated abbreviation of "cluster based multi threading"

Essentially, they can say they have cluster based multi threading because each cluster does have two threads... but each core only has one.

That was my impression, anyway.
 
Gotcha, I was under the impression that jaguar in PS4 has four CPU clusters with 2 cores each with the hyper threading being per cluster. There are multiple threads per cluster, but only because there are two single threaded cores in that cluster. Essentially just AMDs way of claiming hyper threading without actually having it, so to speak.
CBM was my uneducated abbreviation of "cluster based multi threading"

Essentially, they can say they have cluster based multi threading because each cluster does have two threads... but each core only has one.

That was my impression, anyway.
Doesn't work that way.

This is very interesting though:

The Opteron 6276 with CMT disabled has:

• 8 modules
• 8 threads
• 4 ALUs per module
• 2 ALUs per thread (the ALUs can not be shared between threads, so disabling CMT disables half the threads, and as a result also half the ALUs)
• 16 ALUs in total

With CMT enabled, this becomes:

• 8 modules
• 16 threads (double)
• 4 ALUs per module
• 2 ALUs per thread
• 32 ALUs in total (double)

So nothing happens, really. Since CMT doesn’t share the ALUs, it works exactly the same as the usual SMP approach. So you would expect the same scaling, since the execution units are dedicated per thread anyway. Enabling CMT just gives you more threads.

(...)

With single-threading, each thread has more ALUs with SMT than with CMT. With multithreading, each thread has less ALUs (effectively) than CMT.

And that’s why SMT works, and CMT doesn’t: AMD’s previous CPUs also had 3 ALUs per thread. But in order to reduce the size of the modules, AMD chose to use only 2 ALUs per thread now. It is a case of cutting off one’s nose to spite their face: CMT is struggling in single-threaded scenario’s, compared to both the previous-generation Opterons and the Xeons.
Source: http://scalibq.wordpress.com/2012/02/14/the-myth-of-cmt-cluster-based-multithreading/

I highlighted the end because like last gen it seems that developers might have to optimize for specific features that change the whole code layout in order to compensate an architecture downfall on PS4/X720 (in this case for the enabled-CMT setting/more threads), that's an advantage for unoptimized code on Espresso, but then again having more cpu threads can be useful. It just won't apparently result in much of a performance boost in general terms/had other design decisions have been taken; but as it's code struggles (against previous solutions with more ALU's per thread) they pretty much have to go down that way.
 
Doesn't work that way.

This is very interesting though:

Source: http://scalibq.wordpress.com/2012/02/14/the-myth-of-cmt-cluster-based-multithreading/

I highlighted the end because like last gen it seems that developers might have to optimize for specific features that change the whole code layout in order to compensate an architecture downfall on PS4/X720 (in this case for the enabled-CMT setting/more threads), that's an advantage for unoptimized code on Espresso, but then again having more cpu threads can be useful. It just won't apparently result in much of a performance boost in general terms/had other design decisions have been taken; but as it's code struggles (against previous solutions with more ALU's per thread) they pretty much have to go down that way.
Awesome, thanks for the clarification. I had misunderstood. That is interesting.
 

krizzx

Junior Member
Now, I need ask a very basic question about this.

Besides there being 3 cores, exactly what has been added/removed from Broadway. What are the differences?
 

MDX

Member
I'm in no way implying that the Espresso is suddenly just as capable as the Jaguar 4 core CBM in total output, but not referencing this makes the gap look larger than it is. An 8 core CBM Jaguar would actually have 16 physical cores, right? That's a huge difference.

Expresso doesn't need to do multi-threading because it has eDRAM.
 
Gotcha, I was under the impression that jaguar in PS4 has four CPU clusters with 2 cores each with the hyper threading being per cluster. There are multiple threads per cluster, but only because there are two single threaded cores in that cluster. Essentially just AMDs way of claiming hyper threading without actually having it, so to speak.
CBM was my uneducated abbreviation of "cluster based multi threading"

Essentially, they can say they have cluster based multi threading because each cluster does have two threads... but each core only has one.

That was my impression, anyway.

No. PS4 has 2 4 core modules for a total of 8 physical cores. It neither supports, nor claims any kind of symetrical multithreading/hyperthreading.

Doesn't work that way.

This is very interesting though:

Source: http://scalibq.wordpress.com/2012/02/14/the-myth-of-cmt-cluster-based-multithreading/

I highlighted the end because like last gen it seems that developers might have to optimize for specific features that change the whole code layout in order to compensate an architecture downfall on PS4/X720 (in this case for the enabled-CMT setting/more threads), that's an advantage for unoptimized code on Espresso, but then again having more cpu threads can be useful. It just won't apparently result in much of a performance boost in general terms/had other design decisions have been taken; but as it's code struggles (against previous solutions with more ALU's per thread) they pretty much have to go down that way.

Neither the PS4 or Durango are expected to support SMT/"CMT". Espresso offers no advantage. In each system a single physical core only supports a single thread. And in the case of the PS4 each Jaguar core is faster than an Espresso core, and there are more of them running at a higher frequency.

Expresso doesn't need to do multi-threading because it has eDRAM.

Multi-threading has nothing to do with the eDRAM.
 
Neither the PS4 or Durango are expected to support SMT/"CMT". Espresso offers no advantage. In each system a single physical core only supports a single thread. And in the case of the PS4 each Jaguar core is faster than an Espresso core, and there are more of them running at a higher frequency.
SMT is not CMT. I've commited that error before thinking they should be similar; for marketing purposes they might be so, but in the end not so.

It's as I said, if it's a Jaguar/steamroller design then AFAIK it's supposed to have a 2 ALU per thread design with 2-way CMT. No getting around that.
 

krizzx

Junior Member
No. PS4 has 2 4 core modules for a total of 8 physical cores. It neither supports, nor claims any kind of symetrical multithreading/hyperthreading.



Neither the PS4 or Durango are expected to support SMT/"CMT". Espresso offers no advantage. In each system a single physical core only supports a single thread. And in the case of the PS4 each Jaguar core is faster than an Espresso core, and there are more of them running at a higher frequency.



Multi-threading has nothing to do with the eDRAM.

2 4 core modules? That would mean is has 2 processors. That thing is going to guzzle a lot of gas.

Still, I want some "real world" performance estimate of what Expresso can do. It does 3 processes per cycle and has 3 core, so that would mean that it can do 9 processes per cycle by the current estimate. That would make the Wii U's CPU equivalent to a normal 3.6 GHz tri-core processor at minimum.

I feel that it would be better to use an Espresso Core for physics than Latte. I honestly don't see why so much CPU power is needed these days. What actually happens on screen hasn't changed that much from what you would see going on back in the PS2/Xbox1/GC days. Prime example would be the Dynasty/Samurai Warriros games.

If you programmed one core for physics, one for A.I., and another for miscellaneous processing then that would free up the GPU for purely processing graphics at its fullest. I don't see how there would be any issues.
 

i-Lo

Member
Sorry if this has been asked before but what's the average and maximum power draw (in wattage) that can expected of this CPU?
 

krizzx

Junior Member
Sorry if this has been asked before but what's the average and maximum power draw (in wattage) that can expected of this CPU?

I believe someone said 7 watts earlier in the thread.

The processor is "extremely" powerful for the amount of energy it uses. I believe this is what the Most Wanted developer meant. I'd dare say that, watt for watt, it is the one if not the most efficient processor on earth.

You get far more out of it than you put in. I don't see where people get their complaints from. So what if it isn't isn't calculating floating points on six units at once? Its pretty close in performance and it is doing it at less than 1/5 the power draw.

I'd say Nintendo needs to stick with this design and keep improving it. The next processor they release should be 4 tri-core espresso's on a single die or 2 cores with hyper threading. That combined with a moderate clock boost, some extra cache and a few enhanced features would put it ahead of the processors rumored to be in the Durango and the one announced for the PS4.
 

krizzx

Junior Member
Doesn't work that way.

This is very interesting though:

Source: http://scalibq.wordpress.com/2012/02/14/the-myth-of-cmt-cluster-based-multithreading/

I highlighted the end because like last gen it seems that developers might have to optimize for specific features that change the whole code layout in order to compensate an architecture downfall on PS4/X720 (in this case for the enabled-CMT setting/more threads), that's an advantage for unoptimized code on Espresso, but then again having more cpu threads can be useful. It just won't apparently result in much of a performance boost in general terms/had other design decisions have been taken; but as it's code struggles (against previous solutions with more ALU's per thread) they pretty much have to go down that way.

I've been wondering. Just how many instructions did the PS3 and 360 CPU's do per cycle "baseline"(ie. not counting other hardware features)?
 

Thraktor

Member
Multi-threading has nothing to do with the eDRAM.

It sort-of does. Many multi-threading implementations are simply there to hide latency, and more cache means less latency to hide.

Of course, that's not to say multi-threading wouldn't help, but we're basically talking about a completely different CPU in that case anyway.
 
I've been wondering. Just how many instructions did the PS3 and 360 CPU's do per cycle "baseline"(ie. not counting other hardware features)?
It's dual issue. I don't know how many instructions it can possibly hope to retire by cycle, but it's either one or two; I'm leaning towards one, the whole point of 2-way SMT in there is that when one pipeline/issue stalls the other gets full priority after all, so unless they function concurrently (as in 50% overhead per thread) then it's most likely one.

PPC750 issues 3, retires 2.

This is not the be all end all though; if you had a PPC750 clocked at the same speed as a PPC970 (G5) the PPC750 would actually perform better in general purpose; but the PPC970 can issue 8 and retire 5. The time a cycle takes to complete varies, PPC pipeline is longer so cycles take longer to complete and that's why they opted to make it retire more per cycle to compensate for longer cycles per clock.

It has to do with the number of pipeline stages (the less there are the faster a cycle completion will be assuming same MHz), branch prediction type/effectivenes and cache miss. It's said cache miss amounts to a whooping 5% occurrence on PS3/X360, that means it clogs the pipeline and it takes a penalty (cycles where it's not available again, until it makes away with that pipeline crash, think of it as an accident on a highway). Anyway, everything helps, or rather on PS3 and X360 case everything is not really helping; it doesn't help that their pipelines are ~30 stages long for instance, cycles take more time to complete and cache miss "accidents" take more time to solve too (and the fact they happen 5% of the time is huge).

But all this to say, they're different beasts too, so it's hard to compare, but it's easy to see PPC750 is hugely more efficient per clock.
I believe someone said 7 watts earlier in the thread.
Yes, certainly no more than 7W in full load.
 

Clefargle

Member
Does the recent talk of no UE4 on the Wii U mean that it's not capable of running it, or that Epic considers it too costly to scale it to Espresso? How do the DirectX capabilities of the system translate to UE4?
 
More like not worth the manhours spent on doing it for projected income imho since they said later companies can move their UE4 games to Wii U if they want.
 

Hoo-doo

Banned
Does the recent talk of no UE4 on the Wii U mean that it's not capable of running it, or that Epic considers it too costly to scale it to Espresso? How do the DirectX capabilities of the system translate to UE4?

Not worth the investment by Epic to support it.
I'm sure it could technically support UE4, albeit in a stripped-down version.
I mean, it's supposed to run on phones and tablets eventually.
 
Does the recent talk of no UE4 on the Wii U mean that it's not capable of running it, or that Epic considers it too costly to scale it to Espresso? How do the DirectX capabilities of the system translate to UE4?

I honestly think it's laziness, given that wasn't a stated goal of UE4 that it would scale from smart phones to high end computers?

Personally, I think it's stupid on Epic's part. Fine. Don't give it flagship UE4 games, but get your engine out on as many devices as possible.
 

Hoo-doo

Banned
I honestly think it's laziness, given that wasn't a stated goal of UE4 that it would scale from smart phones to high end computers?

Personally, I think it's stupid on Epic's part. Fine. Don't give it flagship UE4 games, but get your engine out on as many devices as possible.

Laziness is the absolute worst fucking argument/defense/insult in these kind of threads.
It's good business, simple as that. If the Wii-U ends up selling gangbusters you'll see the support.

These guys are not leaning back in their office chairs with their feet up saying "Well, i'm supposed to port the UE4 engine to Wii-U, but i'd rather take a nap."
 

tkscz

Member
Laziness is the absolute worst fucking argument/defense/insult in these kind of threads.
It's good business, simple as that. If the Wii-U ends up selling gangbusters you'll see the support.

These guys are not leaning back in their office chairs with their feet up saying "Well, i'm supposed to port the UE4 engine to Wii-U, but i'd rather take a nap."

This would be hilarious if that were true.
 

blu

Wants the largest console games publisher to avoid Nintendo's platforms.
It's dual issue. I don't know how many instructions it can possibly hope to retire by cycle, but it's either one or two; I'm leaning towards one, the whole point of 2-way SMT in there is that when one pipeline/issue stalls the other gets full priority after all, so unless they function concurrently (as in 50% overhead per thread) then it's most likely one.

PPC750 issues 3, retires 2.

This is not the be all end all though; if you had a PPC750 clocked at the same speed as a PPC970 (G5) the PPC750 would actually perform better in general purpose; but the PPC970 can issue 8 and retire 5. The time a cycle takes to complete varies, PPC pipeline is longer so cycles take longer to complete and that's why they opted to make it retire more per cycle to compensate for longer cycles per clock.

It has to do with the number of pipeline stages (the less there are the faster a cycle completion will be assuming same MHz), branch prediction type/effectivenes and cache miss. It's said cache miss amounts to a whooping 5% occurrence on PS3/X360, that means it clogs the pipeline and it takes a penalty (cycles where it's not available again, until it makes away with that pipeline crash, think of it as an accident on a highway). Anyway, everything helps, or rather on PS3 and X360 case everything is not really helping; it doesn't help that their pipelines are ~30 stages long for instance, cycles take more time to complete and cache miss "accidents" take more time to solve too (and the fact they happen 5% of the time is huge).

But all this to say, they're different beasts too, so it's hard to compare, but it's easy to see PPC750 is hugely more efficient per clock.Yes, certainly no more than 7W in full load.
I seems to me your terminology is getting astray here. A cycle is a cpu clock cycle. I.e. cycles = clocks. The relationship between clock and pipeline length is normally this: the higher the clock, the less work a pipeline stage can afford to do, ergo the greater the number of pipeline stages. The lower the clock, the more you can afford to do per stage, ergo the fewer stages. The big thing about the G3 in the WiiU is that it is among the shortest-pipeline architectures out there that reach this clock (1.25GHz). A short pipeline is generally better than a long one, per the same clock, at general purpsoe I.e. Espresso should perform better at GP than most other similarly-clocked CPUs out there, with similar super-scalar architecture. I don't know where you get the impression a 750 would outperform a 970 at GP, though - the latter is much more 'super'-scalar than the former, to put it metaphorically.
 

Mithos

Member
Laziness is the absolute worst fucking argument/defense/insult in these kind of threads.
It's good business, simple as that. If the Wii-U ends up selling gangbusters you'll see the support.

If Wii U starts selling gangbusters we might see Epic improving upon the UE3 engine, and by that I mean optimizing it above and beyond the "it's working" where its currently at.
 
I seems to me your terminology is getting astray here. A cycle is a cpu clock cycle. I.e. cycles = clocks.
On the clock I meant cycles per second.

You're right though, I must have been high when I wrote that. Thanks for pointing it out.
I don't know where you get the impression a 750 would outperform a 970 at GP, though - the latter is much more 'super'-scalar than the former, to put it metaphorically.
Benchmarks. At the same frequency PPC970 loses slightly against both G3 and G4 CPU's in GP. Of course though, G5/970 was much more scalable, that was the whole point for it.

It's one of the reasons Apple never used them on laptops even if 1.5 GHz 30W parts were available; the G4 @ 1.67 GHz more than matched it.
 

blu

Wants the largest console games publisher to avoid Nintendo's platforms.
Benchmarks. At the same frequency PPC970 loses slightly against both G3 and G4 CPU's in GP. Of course though, G5/970 was much more scalable, that was the whole point for it.

It's one of the reasons Apple never used them on laptops even if 1.5 GHz 30W parts were available; the G4 @ 1.67 GHz more than matched it.
970 was more performant per-cycle than any of the previous ppc designs. It did not support L3, though, which some of the G4 did (not all G4 had L3 in Apple's machines, though). That may have affected a benchmark or two.

Here's a very good G5 manual by Apple themselves: https://developer.apple.com/hardwaredrivers/ve/g5.html

Essentially, outside of the max dispatch gradation (i.e. 2+1 for G3, 3+1 for G4, 4+1 for G5), the out-of-order facilities of the G5 are in an entirely different ballpark compared to the predecessors. And G4's Altivec block as found in the 7400's used by Apple was actually in-order (only one 7400 version was an exception to that, and IIRC Apple never used that in their designs).
 
I honestly think it's laziness, given that wasn't a stated goal of UE4 that it would scale from smart phones to high end computers?

Personally, I think it's stupid on Epic's part. Fine. Don't give it flagship UE4 games, but get your engine out on as many devices as possible.

I think it's more that UE4 is optimized more for X86 architecture, and isn't optimized for PPC architecture. So why would someone who is looking to make a Wii U game spend the money to license UE4 when say for instance Retro Studios on behalf of Nintendo, is working on a variety of Wii U engines for developers to use, and until those are available, UE3.9 works just fine?
 

Rolf NB

Member
If the G5 gets beaten by any of its predecessors in anything at the same clock rate, it's either due to higher branch mispredict latency, or because you're exploiting a fancy SIMD extension that your particular tested G5 doesn't support. It's never because of "longer clock cycles". That isn't even possible under the presented scenario (=same clock rate).
 
Top Bottom