• Hey, guest user. Hope you're enjoying NeoGAF! Have you considered registering for an account? Come join us and add your take to the daily discourse.

Wii U CPU |Espresso| Die Photo - Courtesy of Chipworks

Durante

Member
So I have read every single page of this thread over the last 2 months, please correct me if Im wrong but is this a simplistic condensation of the consensus this thread has reached: "Wii U is marvelously efficient but builds off of withered tech that would allow smexy visuals if a dev were wiling to code for it explicitly"?
I would call that an euphemism.

If the primary goal was efficiency (in terms of performance/Watt) then it would be possible to build a machine that achiees higher performance at Wii U's TDP, e.g. using modern mobile hardware -- particularly in terms of CPU. The need for hardware compatibility with what is essentially a 15 year old platform also prevents it from achieving "marvelous" efficiency.

In fact, it remains to be seen if it is any more efficient than the upcoming console platforms. Depending on how much power those will consume on average, I can see them easily being more efficient (while also obviously being much more powerful).


Edit: Regarding the current discussion, I really don't see what makes forums any less reliable than say a personal blog. Anyone can have a blog.
 

blu

Wants the largest console games publisher to avoid Nintendo's platforms.
Forums are great for many things. However, relying on them for accurate information is a terrible idea.
There's two kinds of information when it comes to reliability: information that is verifiable and such which is not. The first kind can come from anonymous sources, etc - it does not matter, as long as it's actually verifiable. The second kind is a matter of reputation of the source - nobody here can verify it, but since such-and-such said so, it must be so.

While forums (or any other contemporary information exchange hub) can be a hit-n-miss source for information of the second kind, forums are as good as any source for information of the first kind.

Nobody goes to a forum for medical diagnoses of investment advice, or at least I hope not.
Actually tons of people go to forums every day for medical advice. I'd assume less people seek investment advice, but in general less people care about investments than about health issues.

The information provided in some of the threads on this forum has been nothing but pure, verifiable factual data, which has been collected with sufficient amount of effort by qualified people. Whether everybody reading that info has been able to interpret that as such is another matter altogether.

The Lottes' thing really bugged me because the egotistical douchebaggery of some anonymous forum types, including a couple at b3d, shot down an interesting topic. Lottes deleted the blog and stopped talking.
& look at the reaction to some random tweets from a completely unknown EA software engineer who was only answering some questions on twittter to see what I mean. If people were less weird on forums, he would have probably answered more questions on his twitter.
The random EA software engineer did not disclose any information. He expressed his personal opinion (which he's surely entitled to) in a non-professional manner (which he's surely not supposed to). He realized his mistake, though it was too late for all practical purposes.

None of the serious information posted on GAF so far has been retracted. It's sitting here, verifiable. If you had the ability and time you could decap Latte and Espresso, see what those looked like on the inside and make some educated guesses like those made by fellow gaffers. Likewise, you can get the test code I've posted, run it on the respected machines an get the results gaffers have posted.

Now, if you're looking for a purchase advice the above might not be of much good to you - 'information' of the sort the EA guy just retracted could be of more use to you. If you decided to take it at face value, that is.
 
Xbox was PPC, not x86 like PC btw. It also required much work to understand from developers. Thats why the best performing games were at the end.

Sorry to be a little nitpicky, but Xbox 1 was Intel x86 based. Unless you're referring to the 360 in which case, carry on.

Code:
CPU 	733 MHz x86 Intel Celeron/Pentium III Custom Hybrid
 

QaaQer

Member
The problem is that he made some throwaway comments without any context whatsoever. We don't know if he ever actually worked on the platform or even just read the documentation. We don't know why he thinks it's "crap", we don't know how exactly it's supposed to be "weaker than 360", and his "3rd parties don't make money on Nintendo platforms, only Mario and Zelda sell" statement is factually wrong and reeks of bias.

Twitter and context, yeah not gonna happen.

Is it better that we never hear anything that doesn't go through the PR agency first? And maybe if there wasn't the shitstorm, someone could have asked for clarification or more info.

As far as what sells on Ninty consoles, http://en.wikipedia.org/wiki/List_of_best-selling_Wii_video_games, I think you need to get to #15 for a non-nintendo property. 360 looks to have more 3rd party opportunities: http://en.wikipedia.org/wiki/List_of_best-selling_Xbox_360_video_games

But like I said, people like finding reasons to get indignant and angry. sux to be the lighting rod for that kind of thing.
 
From what I understand, it is a lot different from the x86 setup of the other guys. Even Nintendo is having problems.
Not really.

It's different, but at least it's the same nature this time around, general purpose driven.

Last gen the PPC on X360/PS3 wasn't really tuned up for general purpose, which is what you typically want a CPU for the most; to perform "general" stuff you can't perform anywhere else.

Alas, PPC to x86 can be pretty abstract these days and down to compiling, the thing you have to watch for is the SIMD capabilities of the Espresso, they're not that great.
Xbox was PPC, not x86 like PC btw. It also required much work to understand from developers. Thats why the best performing games were at the end.
That really wasn't the issue, trust me.

The issue was that these CPU's, the PPC ones were in-order (cpu's haven't been like that for a while), focused on 2-way executon to negate that disadvantage and surpassed 32 stages without cache miss prediction. They were crap compared to what you had on PC and sacrificed everything so they could have better FPU, and in the end you had to code differently for them in order to get the best performance out; otherwise such best performance achievable with regular code wouldn't be above a regular 1 GHz PowerPC in most case scenarios, here goes some DMIPS figures:

Xbox 1 XCPU: 951.64 DMIPS @ 733 MHz
Pentium III: 1124.311 @ 866 MHz
GC Gekko: 1125 DMIPS @ 486 MHz
Wii Broadway: 1687.5 DMIPS @ 729 MHz

Pentium 4A: 1694.717 @ 2 GHz
PS3 Cell PPE: 1879.630 DMIPS @ 3.2 GHz (sans SPE, SPE's are not meant for dhrystones/general purpose code)
X360 Xenon: 1879.630 DMIPS*3 = 5638.90 DMIPS @ 3.2 GHz (each 3.2 GHz core performing the same as the PS3)
PowerPC G4: 2202.600 @ 1.25GHz
AMD Bobcat: 2662.5*2 = 5325 DMIPS @ 1 GHz
Wii U Espresso: 2877.32 DMIPS*3 = 8631.94 DMIPS @ 1.24 GHz (again, final performance taking into account 3 fully accessible cores)
Pentium4 3.2GHz: 3258.068
8 core Bobcat: 4260*8 = 34080 DMIPS @ 1.6 GHz (said CPU doesn't exist, but best case scenario Jaguar is supposed to perform only 20% better; that would be 5112 DMIPS per core, 40896 DMIPS for 8 cpu's, but it's probably somewhere in between; again, taking into account 6 fully accessible cores, rumours suggest 2 cores for OS or so)

As you can see, PS3 and X360 take a beating per core out of a 1.25 GHz PPC; they were also ridiculously bad at running 64 bit code which negates the disadvantage for Espresso being a 32-bit design/implementation.

Then you have blu benchmarks which I believe take into account floating point performance seeing he optimized code for paired singles; so it's not just dhrystones, also wetstones (or wetstones bench, I really haven't followed that much :x).

That need to do things in a way they usually aren't is mostly not there for both Wii U or PS4/X720 ("mostly" being employed because if you're coding specifically for something you can always put it closer to the metal than you would, and every cpu has it's quirks. But the nature behind it is the same. Of course Nintendo being x86 would be an advantage, but as is doesn't weight much; it's equally hard to pull X360/PS3 cpu code on both Wii U as next-gen platforms (where the best case scenario is the 8 jaguar cpu cores beng able to match the floating point performance of Xenon/CELL). So code should be more straightforward this time around.

Even porting from Xenos to another PPC, like 750/Gekko/Broadway/Espresso (G3), 7400 (G4) or 970 (G5) isn't a walk in the park; architecture is besides the point when they are so different.
 

QaaQer

Member
Not really.

It's different, but at least it's the same nature this time around, general purpose driven.

Last gen the PPC on X360/PS3 wasn't really tuned up for general purpose, which is what you typically want a CPU for the most; to perform "general" stuff you can't perform anywhere else.

Alas, PPC to x86 can be pretty abstract this days and down to compiling, the thing you have to watch for is the SIMD capabilities of the Espresso, they're not that great.That really wasn't the issue, trust me.

The issue was that these CPU's, the PPC ones were out of order and surpassed 32 stages without cache miss prediction. They were crap compared to what you had on PC, they sacrificed everything so they could have better FPU, and in the end you had to code differently for them.

That's mostly not there for both Wii U or PS4/X720 (mostly because if you're coding specifically for something you can always put it closer to the metal than you would, and every cpu has it's quirks. But the nature behind it is the same.

Even porting from Xenos to another PPC, like 750 (G3), 7400 (G4) or 970 (G5) wasn't a walk in the park.

woosh. thanks though.
 

QaaQer

Member
Edit: Regarding the current discussion, I really don't see what makes forums any less reliable than say a personal blog. Anyone can have a blog.

Accountability if the blog has the person's name attached to it. I can find out who Hector Martin or Timothy Lottes is, and weight their words accordingly.

Certainly there are very intelligent people in forums who are honest and know their stuff, but plebs like me have no way of knowing who they are.
 

wsippel

Banned
As far as what sells on Ninty consoles, http://en.wikipedia.org/wiki/List_of_best-selling_Wii_video_games, I think you need to get to #15 for a non-nintendo property. 360 looks to have more 3rd party opportunities: http://en.wikipedia.org/wiki/List_of_best-selling_Xbox_360_video_games
That has less to do with 3rd parties not selling well and more with Nintendo titles selling exceptionally well. And according to those lists, outside of Call of Duty, no 3rd party game on 360 sold more than Just Dance 2 on Wii.
 

QaaQer

Member
That has less to do with 3rd parties not selling well and more with Nintendo titles selling exceptionally well. And according to those lists, outside of Call of Duty, no 3rd party game on 360 sold more than Just Dance 2 on Wii.

not gonna argue except to say ea guy's point was Nintendo software dominates software sales on Nintendo platforms.
 

krizzx

Junior Member
which is not what he said, he sais only Mario and Zelda make money which is a flat out lie, i'm sure ubisoft and activision are currently pissing themselves laughing at this guy

Agreed. Last I checked, the Wii had 140 million sellears and most of them had nothing to do with Nintendo.

Sonic games sold better on the Wii compared to the 360/PS3. The best selling game on a single platform in Grasshopper Manufacturers history was on the Wii.

The claims that only Sonic and Zelda sell are outright lies. Only 2 of the highest 7 selling games on the console have Mario in them.

Capcom had 3 mil+ sellers on the Wii
Konami had 4 mil+ sellers on the Wii
Warner Bros had 4 mil+ sellers on the Wii
Lucas Arts had 6 mil+ sellers on the Wii
Disney has 10 mil+ sellers on the Wii
SEGA has 10 mil+ sellers on the Wii
Ubisoft has 14 mil+ sellers on the Wii
Activision has 15 mil+ sellers on the Wii
Electronic Arts had 15 mil+ sellers on the Wii...which is ironic and much more than they should have

His comment was a lie even where EA was concerned.

Also, console owners software is "supposed to be" the best selling on a the hardware. Nintenod's sales should be the highest, because they release the highest quality products. They had 35 million sellers on the Wii. Also, the Top 15 best selling games in history were all published by Nintendo. That alone would make me want to have my product on their system, preferably published by them.

EA is going to lose more out of this even than Nintendo will.
Not really.

It's different, but at least it's the same nature this time around, general purpose driven.

Last gen the PPC on X360/PS3 wasn't really tuned up for general purpose, which is what you typically want a CPU for the most; to perform "general" stuff you can't perform anywhere else.

Alas, PPC to x86 can be pretty abstract these days and down to compiling, the thing you have to watch for is the SIMD capabilities of the Espresso, they're not that great.That really wasn't the issue, trust me.

The issue was that these CPU's, the PPC ones were in-order (cpu's haven't been like that for a while), focused on 2-way executon to negate that disadvantage and surpassed 32 stages without cache miss prediction. They were crap compared to what you had on PC and sacrificed everything so they could have better FPU, and in the end you had to code differently for them in order to get the best performance out; otherwise such best performance achievable with regular code wouldn't be above a regular 1 GHz PowerPC in most case scenarios, here goes some DMIPS figures:

Xbox 1 XCPU: 951.64 DMIPS @ 733 MHz
Pentium III: 1124.311 @ 866 MHz
GC Gekko: 1125 DMIPS @ 486 MHz
Wii Broadway: 1687.5 DMIPS @ 729 MHz

Pentium 4A: 1694.717 @ 2 GHz
PS3 Cell PPE: 1879.630 DMIPS @ 3.2 GHz (sans SPE, SPE's are not meant for dhrystones/general purpose code)
X360 Xenon: 1879.630 DMIPS*3 = 5638.90 DMIPS @ 3.2 GHz (each 3.2 GHz core performing the same as the PS3)
PowerPC G4: 2202.600 @ 1.25GHz
AMD Bobcat: 2662.5*2 = 5325 DMIPS @ 1 GHz
Wii U Espresso: 2877.32 DMIPS*3 = 8631.94 DMIPS @ 1.24 GHz (again, final performance taking into account 3 fully accessible 3)
Pentium4 3.2GHz: 3258.068
8 core Bobcat: 4260*8 = 34080 DMIPS @ 1.6 GHz (said CPU doesn't exist, but best case scenario Jaguar is supposed to perform only 20% better; that would be 5112 DMIPS per core, 40896 DMIPS for 8 cpu's, but it's probably somewhere in between)

As you can see, PS3 and X360 take a beating per core out of a 1.25 GHz PPC; they were also ridiculously bad at running 64 bit code for 64-bit cpu's.

Then you have blu benchmarks which I believe take into account floating point performance seeing he optimized code for paired singles; so it's not just dhrystones, also wetstones (or wetstones bench, I really haven't followed that much :x).

That need to bo things in a way they usually aren't is mostly not there for both Wii U or PS4/X720 ("mostly" being employed because if you're coding specifically for something you can always put it closer to the metal than you would, and every cpu has it's quirks. But the nature behind it is the same. Of course Nintendo being x86 would be an advantage, but as is doesn't weight much; it's equally hard to pull X360/PS3 cpu code on both Wii U as next-gen platforms (where the best case scenario is the 8 jaguar cpu cores beng able to match the floating point performance of Xenon/CELL). So code should be more straightforward this time around.

Even porting from Xenos to another PPC, like 750/Gekko/Broadway/Espresso (G3), 7400 (G4) or 970 (G5) isn't a walk in the park; architecture is besides the point when they are so different.

Woah, I did'nt know Espresso had that much higher performance. Did you also take into account the finding from the last few pages that Espresso has a registry increase?
http://www.neogaf.com/forum/showpost.php?p=57397996&postcount=565
From 38 to 48 physical registers
From 38 to 64 integer registers
How much would this effect performance were this the case?
 
Woah, I did'nt know Espresso had that much higher performance. Did you also take into account the finding from the last few pages that Espresso has a registry increase?
http://www.neogaf.com/forum/showpost.php?p=57397996&postcount=565
From 38 to 48 physical registers
From 38 to 64 integer registers
How much would this effect performance were this the case?
Yeah, I think some charts are in order as a means to put things into perspective. Because you can have numbers but you're not visualizing the difference; specially true for blu's benchmarks as I've looked into yesterday.

As for added logic, no way to know, but if that's the case we'll have to wait until someone can run some tests. I also don't know if wsippel is talking about gekko/broadway and espresso or specifically espresso.

This because we don't have a lot of tech on them, nor "proper" professional core shots; perhaps such changes in logic are the paired singles implementation and are present in past ittinerations of the chip, I really dunno.
 

efyu_lemonardo

May I have a cookie?
man, pentium 4 was a crap architecture...

was it really just meant to allow intel to push out chips with higher clocks in order to fool the layman consumer into thinking they performed better?

I mean I understand the theoretical benefits of having such a design but surely it was clear enough early on that unless the CPU was handling extremely predictable computations there would be no benefit to having such a long pipeline?

Did intel not believe graphics cards would support many non-gaming related SIMD applications such as video playback etc?
 

Rolf NB

Member
From what I understand, it is a lot different from the x86 setup of the other guys. Even Nintendo is having problems.

Xbox was PPC, not x86 like PC btw. It also required much work to understand from developers. Thats why the best performing games were at the end.
OG Xbox's CPU was a Celeron 600A (on-die cache) with a sped-up FSB. Or call it a Pentium III. Same core architecture anyway.
http://en.wikipedia.org/wiki/Xbox_(console)#Technical_specifications

Xbox 360 switched to PPC, yes.

edit: whoah, so beaten, nevermind.
 
Woah, I did'nt know Espresso had that much higher performance. Did you also take into account the finding from the last few pages that Espresso has a registry increase?
http://www.neogaf.com/forum/showpost.php?p=57397996&postcount=565
From 38 to 48 physical registers
From 38 to 64 integer registers
How much would this effect performance were this the case?
No, those numbers on the Espresso, if you look closely at them, are obtained by factoring only the number of cores and the clock increase, because those tests hadn't been used with the WiiU CPU.

Even if those changes could affect the performance of a Dhrystone test, since this is speculation based only on the number of cores and the clock increase, they are not considered in here.

Both the increase in registers and the huge increase in L2 cache is to make the CPU more efficient in a per-clock basis, and not to increase it's peak performance, so in real games with code big enough to fill the bigger caches, and of course thanks to this increase in memory registers (this is the closest bit of memory in a CPU, increasing the number of registers can have a noticeable impact in real world performance) the difference against their Wii ancestors will be surely much bigger.

Of course, thats not to speak for the Xbox360 and PS3 CPUs. On integer tests, the PS3 CPU was nearly as weak as the Wii one even in theoretical peaks, not counting the fact that the Wii CPU was much more efficient to begin with).
 
Orly? People run zlib on SPEs, you know?
That's for data compression and decrompression. It's an application right down it's alley; just as encode and decoding video is and the like; stuff that is faster to do on floating point than on general purpose.

SPE's can be used in creative ways via injecting and controling code via the PPE so they actually had to help negate the advantage in dmips Xenon had; but in the end that's not what we're measuring here.
 
No, those numbers on the Espresso, if you look closely at them, are obtained by factoring only the number of cores and the clock increase, because those tests hadn't been used with the WiiU CPU.
Yes, precisely.

Only going by efficiency per clock of the Gekko and scaling up.

With the increased cache, and it being eDRAM, performance per MHz could have went as high as 2.71 DMIPS/MHz (up from 2.32 DMIPS/MHz on the Gekko/Broadway) that's more noticeable the higher the clock is (could mean 3368.5 DMIPS instead of the projected/listed 2877.3 DMIPS @ 1243 MHz)

Said improvement was seen on PPC 476FP with pretty superficial chances to the architecture, most of them being on upgrading the FP unit (obviously) and adding more L2 cache (up to 1 MB of it, per core); it also has SMP, so it's fair to say they might have lifted some implementations direction from it and retrofit them into the 750CL. Incidentally, performance before, on that line was in line with PPC 750's, at ~2.3 DMIPS/MHz.

But of course, I didn't flag such possible increments because even if they're somewhat likely to some extent from the increased cache alone, it didn't seem right to speculate. whereas if one had a 3 Broadway's @ 1.243 GHz that would be the performance to expect.
 

Rolf NB

Member
That's for data compression and decrompression. It's an application right down it's alley; just as encode and decoding video is and the like; stuff that is faster to do on floating point than on general purpose.

SPE's can be used in creative ways via injecting and controling code via the PPE so they actually had to help negate the advantage in dmips Xenon had; but in the end that's not what we're measuring here.
zlib is a pure integer task. Other pure integer tasks that run with great efficiency on SPEs include text processing, XML processing, whole IP stacks including DNS, and indeed the Dhrystone benchmark.

SPEs don't need "injecting" help. They can load code themselves. They can compute the 501st prime number, multiply by pi (but only on a Sunday, otherwise divide), round down to nearest multiple of 4, use the result as an address and jump to it. Their ISAs are aggressively turing complete. They can do anything and everything a general purpose CPU can do, because they are general purpose CPUs.

The only separation point between an SPE and any other general purpose CPU is how main memory access works. But just to be clear, an SPE can access main memory, all of it, all by itself.
 
man, pentium 4 was a crap architecture...

was it really just meant to allow intel to push out chips with higher clocks in order to fool the layman consumer into thinking they performed better?
You have to take into account the time it happened, there was this GHz gold rush in 1997-99 where everyone was trying to get there, but having to deal with small pipeline architectures made it difficult; I remember Intel had to recall all the original 1.13 GHz Pentium III cpu's because they were unstable and on top of it all the processor's microcode was tampered with; benchmark wise it just behaved like a 1 GHz Pentium 3.

This CPU fiasco happened because AMD was able to reach 1.2 GHz on their part; Intel only managed to after a core shrink and partial redesign (Tualatin).

Scaling was tight, intel wanted out, there was the customer part of the situation, yes, but they also thought they could scale it up as high as 10 GHz eventually; and at that speed they could negate whatever.

Right at the same time, IBM was developing the GuTS, short for Gigahertz Unit Test Site (later called Rivina), which was a lenghtned pipeline design and was in order:

IBM, like any large technology company does research. In the following year (1997), long before GHz or 64 bit CPUs arrived on the desktop IBM developed an experimental 64 bit PowerPC which ran at 1GHz. Its snappy title was guTS (GigaHertz unit Test Site) [guTS].

The guTS and a later successor were designed to test circuit design techniques for high frequency, not low power. However since it was only for research the architecture of the CPU was very simple, unlike other modern processors it is in-order and can only issue a single instruction at a time. The first version only implemented part of the PowerPC instruction set, a later version in 2000 implemented it all.

It turns out that the power consumption problem has become so bad that if you want high clock frequency you now have to simplify, there is simply no choice in the matter. If you don’t simplify the CPU will consume so much power it will become very difficult to cool and thus the clock speed will be limited. Both IBM and Intel have discovered this rather publicly, try buying 3GHz G5 or a 4GHz P4.

When a low power, high clocked general purpose core was required for the Cell, this simple experimental CPU designed without power constraints in mind turned out to be perfect. The architecture has since been considerably modified, the now dual issue, dual-threaded PPE is a descendant of the guTS.

The XBox360’s “Xenon” [Xbox360] processor cores also appear to be derived from the guTS processor although they are not quite the same as the PPE. In the Cell the PPE uses the PowerPC instruction set and acts as a controller for the more specialised SPEs. The Xenon cores uses a modified version of the PowerPC instruction set with additional instructions and a beefed up 128 register VMX unit.
Source: http://www.blachford.info/computer/Cell/Cell4_v2.html

There you go, from failed experiment to a mass produced component. Pentium 4 and this console generation cpu's are no different, bar the fact that they kept the simple mindset of the original, lacking stuff like cache miss prediction that saved Pentium 4's ass all the time.

But my point was, this was a test phase, Pentium 4 and this IBM CPU, in order to get us somewhere else; Intel managed to implement hyper-threading on them first too because they had the ceiling overhead to do so, it's an important test phase of it's own even if it was based on a trial and error of often wrong decisions.
I mean I understand the theoretical benefits of having such a design but surely it was clear enough early on that unless the CPU was handling extremely predictable computations there would be no benefit to having such a long pipeline?

Did intel not believe graphics cards would support many non-gaming related SIMD applications such as video playback etc?
I don't that was it, for Pentium 4 has a pretty weak FPU. They thought they could scale it, that would be the advantage.

And well, it did scale up better, but it performed significantly worse per clock.
zlib is a pure integer task. Other pure integer tasks that run with great efficiency on SPEs include text processing, XML processing, whole IP stacks including DNS, and indeed the Dhrystone benchmark.
I believe it can still be treated as microcode of sorts, and SPE is good for that, it's like a DSP or a unit for running code in cascade that happens to be programmable and can be manipulated by the cpu.

My point is SPE's are not general purpose CPU's, and they're not meant to, say, run an operating system; they're meant for running complimentary tasks; they're not the brains there, and General Purpose performance on them is probably not measurable.
SPEs don't need "injecting" help. They can load code themselves. They can compute the 501st prime number, multiply by pi (but only on a Sunday, otherwise divide), round down to nearest multiple of 4, use the result as an address and jump to it. Their ISAs are aggressively turing complete. They can do anything and everything a general purpose CPU can do, because they are general purpose CPUs.
I don't buy it, sorry to be stubborn, but they're not general purpose or at least not very efficient at it; I've read multiple stuff about it along the years as well as tech paper and that's not the idea I have of their capabilities; they were made do things that if you had more cores/performance on the PPE/PPE's you'd never attempt to, but the principle is still solid; it's an architecture meant for SIMD performance that sacrificed quite a bit in return; and SPE's are not PPE's; nor are they good at general purpose (for starters the cache is pitiful).
The only separation point between an SPE and any other general purpose CPU is how main memory access works. But just to be clear, an SPE can access main memory, all of it, all by itself.
It can, but not directly; but that's kinda besides the point; which is you'd be hard pressed to run DMIPS on that and even if you could pull something out performance would be utter crap.

There's a reason why most code struggles on PPE+7 SPE's opposed to 3 PPE's (Xenon), because even if you're really clever there, if you're not coding to the system strenghts and sidestepping the "don't's" you're better off with the 3 PPE's albeit reduced General Purpose performance.

Otherwise it would be simple, CELL would beat Xenon out of the water in GP and it doesn't; not by a blind mile.

But it's better for some tasks that Xenon ends up having to do nonetheless (zlib is probably one of them, sound processing and physics too)


EDIT: I'll go out for a bit, so I'll probably take a while to answer.
 

lord pie

Member
lostinblue said:

I'm curious how much SPU programming you have done? They are exactly as Rolf NB says.

Also, A friend has pointed out that your Dhrystone MIPS values are dramatically different from those on wikipedia. Where did you get the WiiU CPU results? I wasn't aware that homebrew was widely available for it.
 
I'm curious how much SPU programming you have done? They are exactly as Rolf NB says.

Also, A friend has pointed out that your Dhrystone MIPS values are dramatically different from those on wikipedia. Where did you get the WiiU CPU results? I wasn't aware that homebrew was widely available for it.
The wiiu is extrapolated from the Wii's scores
 

wsippel

Banned
I'm curious how much SPU programming you have done? They are exactly as Rolf NB says.

Also, A friend has pointed out that your Dhrystone MIPS values are dramatically different from those on wikipedia. Where did you get the WiiU CPU results? I wasn't aware that homebrew was widely available for it.
I don't do programming, I read tech papers and articles, that said I usually understand the nature, advantages, disadvantages and the like for devices enough to speak about them.

But I'm not silver coating myself no. But I still don't feel like I'm walking over glass on this one, it's simply down to conjecture, I know what they are, he knows it as well, I'm just insisting I don't think it's much good for gp, in fact I take it as a given. It can pull zlib though, that's fine doesn't go against my reasoning.

Regarding the difference you're stating in numbers, I'm well aware of it, bare in mind it's Dhrystone instructions per second (v2.1), hence dmips. You run the benchmark and that comes out. It's indicative of Dhrystone performance and it falls in line with other benches as well (like the geekbench ones for instance), only the PPE is bench-able able via regular means though. Every bench corroborates this as a real world scenario tbh, PPE and a one xenon core only manages to do little over doubling the gp performance of the pentium 3 on the Xbox. We've heard as much from devs early on.

Written from a tablet with auto correct, I've checked but there might be errors still.
The wiiu is extrapolated from the Wii's scores
yes, because dmips/MHz is an accurate figure to calculate stuff. I was far from the first dude here doing that. Said figure (2.32 dmips/MHz) is Wii U worst case scenario, best case scenario being 2.71 dmips/MHz.
 

MDX

Member
Nintendo and IBM signed a billion dollar deal to develop a CPU
for the GameCube and future consoles. Does this deal
extend as far as the WiiU? I think so.
If so, it means that Nintendo and IBM have been thinking long term,
compared to Sony and MS, who are thinking short term as
they take an adhoc approach to console design.

This long term planning has paid off for Nintendo with backwards
compatibility and pushing the boundaries of efficiency without sacrificing
decent & reliable performance. And most likely keeping costs
relatively low in production.

In conclusion, I think its foolish to underestimate Expresso.
This CPU is definitely not an off the shelf solution if we consider its an
evolution since the Gamecube. A lot of thought and design went into
making the product.
 

efyu_lemonardo

May I have a cookie?
You have to take into account the time it happened, there was this GHz gold rush in 1997-99 where everyone was trying to get there, but having to deal with small pipeline architectures made it difficult; I remember Intel had to recall all the original 1.13 GHz Pentium III cpu's because they were unstable and on top of it all the processor's microcode was tampered with; benchmark wise it just behaved like a 1 GHz Pentium 3.

This CPU fiasco happened because AMD was able to reach 1.2 GHz on their part; Intel only managed to after a core shrink and partial redesign (Tualatin).

Scaling was tight, intel wanted out, there was the customer part of the situation, yes, but they also thought they could scale it up as high as 10 GHz eventually; and at that speed they could negate whatever.

Right at the same time, IBM was developing the GuTS, short for Gigahertz Unit Test Site (later called Rivina), which was a lenghtned pipeline design and was in order:

Source: http://www.blachford.info/computer/Cell/Cell4_v2.html

There you go, from failed experiment to a mass produced component. Pentium 4 and this console generation cpu's are no different, bar the fact that they kept the simple mindset of the original, lacking stuff like cache miss prediction that saved Pentium 4's ass all the time.

But my point was, this was a test phase, Pentium 4 and this IBM CPU, in order to get us somewhere else; Intel managed to implement hyper-threading on them first too because they had the ceiling overhead to do so, it's an important test phase of it's own even if it was based on a trial and error of often wrong decisions.I don't that was it, for Pentium 4 has a pretty weak FPU. They thought they could scale it, that would be the advantage.

And well, it did scale up better, but it performed significantly worse per clock.


EDIT: I'll go out for a bit, so I'll probably take a while to answer.

Thanks for the detailed reply, it is definitely interesting from a historical point of view to understand Intel's behavior. I guess even the Larabee project was a result of this same theory about scalability introducing benefits later down the line. From a purely academic point of view I agree there is no better way to learn and make progress.

But the idealist in me can't help but be angry at Intel's anti competitive behavior during those years, and the resulting damage to AMD... I guess this is the inevitable result of combining research and business...
 

efyu_lemonardo

May I have a cookie?
Sorry for double posting, my phone makes it really difficult to edit.

Just wanted to point out to lostinblue that the quotes on SPEs etc didn't come from me...
 

krizzx

Junior Member
No, those numbers on the Espresso, if you look closely at them, are obtained by factoring only the number of cores and the clock increase, because those tests hadn't been used with the WiiU CPU.

Even if those changes could affect the performance of a Dhrystone test, since this is speculation based only on the number of cores and the clock increase, they are not considered in here.

Both the increase in registers and the huge increase in L2 cache is to make the CPU more efficient in a per-clock basis, and not to increase it's peak performance, so in real games with code big enough to fill the bigger caches, and of course thanks to this increase in memory registers (this is the closest bit of memory in a CPU, increasing the number of registers can have a noticeable impact in real world performance) the difference against their Wii ancestors will be surely much bigger.

Of course, thats not to speak for the Xbox360 and PS3 CPUs. On integer tests, the PS3 CPU was nearly as weak as the Wii one even in theoretical peaks, not counting the fact that the Wii CPU was much more efficient to begin with).

Then this confirms my theory of Espresso getting tremendously higher performance if integers are favored in programming compared to floating point.

This also makes sense with the PS3 score, because the Cells SPEs are pure floating point units if my memory is correct. They are basically 1/3 of what makes and actual core.
 
I've lost blu's benchmark, can someone provide a link?
Here you go.
Thanks for the detailed reply, it is definitely interesting from a historical point of view to understand Intel's behavior. I guess even the Larabee project was a result of this same theory about scalability introducing benefits later down the line. From a purely academic point of view I agree there is no better way to learn and make progress.
Yes, Larabee's quite outlandish.

It's possible they'll manage to salvage something from it though.
But the idealist in me can't help but be angry at Intel's anti competitive behavior during those years, and the resulting damage to AMD... I guess this is the inevitable result of combining research and business...
Of course and costumers were the ones doing beta testing, I remember the first Pentium 4's disapointing performance and of course, the mobile toaster variants.

But for all the damage they did to AMD, they also did it to themselves and it was their golden era/chance to shine, and they did. I miss those days tbh; nowadays AMD is that CPU you buy on a budget knowing it can't possibly match Intel's best effort, and that's a shame.

And Intel is going easy on them as well, by not making octocore i5/i7's (whilst they do Xenon's); they actually released the same CPU for desktop, but disabled 2 cores. Intel being agressive just means they felt AMD to be on their tail, hence they couldn't play nice.
Sorry for double posting, my phone makes it really difficult to edit.
I know, I was on a tablet a while ago too; feels convoluting.
 

Argyle

Member
This also makes since with the PS3 score, because the Cells SPEs are pure floating point units if my memory is correct. They are basically 1/3 of what makes and actual core.

This is not true at all. They are general purpose processors.

In my experience they have overall been faster than the PPE (that is, each one can potentially outperform the PPE depending on how your code accesses memory).
 
This is not true at all. They are general purpose processors.

In my experience they have overall been faster than the PPE (that is, each one can potentially outperform the PPE depending on how your code accesses memory).
Well, if you can develop for it can you look into running DMIPS or matmul on the SPE's?

Said results could go into the listings.
 

krizzx

Junior Member
This is not true at all. They are general purpose processors.

In my experience they have overall been faster than the PPE (that is, each one can potentially outperform the PPE depending on how your code accesses memory).

I don't understand how this relates to what I said. I'm primarily talking about the hardware makeup, not its performance. I get my understanding from these.

http://www.lanl.gov/orgs/hpc/salishan/salishan2005/dougjoseph.pdf

I was talking about how all of the componenets that make up a full, true CPU core are physically not present in the Cell's SPEs, primarily the Integer Units.
us__en_us__ibm100__cell_broadband__spe__620x350.jpg
figure1.gif

Espresso should run circles around the Cell in integer performance.
 

Argyle

Member
Well, if you can develop for it can you look into running DMIPS or matmul on the SPE's?

Said results could go into the listings.

Someday, when I have way too much free time...:)

Seriously I'd consider doing it, if I ever do I will let you guys know.

I don't understand how this relates to what I said. I'm primarily talking about the hardware makeup, not its performance. I get my understanding from these.

http://www.lanl.gov/orgs/hpc/salishan/salishan2005/dougjoseph.pdf

I was talking about how all of the componenets that make up a full, true CPU core are physically not present in the Cell's SPEs, primarily the Integer Units.


Espresso should run circles around the Cell in integer performance.

Well...you should have posted a picture that shows a block diagram of the SPE...(see also page 5 of the pdf you linked)

arch.gif


Fixed point = integer in case you were wondering.
 

krizzx

Junior Member
Well...you should have posted a picture that shows a block diagram of the SPE...(see also page 5 of the pdf you linked)

arch.gif


Fixed point = integer in case you were wondering.

I see. I guess I misread it, though it still heavily favors floating points from what I read, compared to the PPC750.
powerpc.4.png


Espresso should still run circles around it in Integer computation.
 
Espresso should still run circles around it in Integer computation.
From what I understand, you're right but it depends on some conditions.

CELL is a sucessor to programable custom DSP's meant to run code in cascade or PS2 Vector Unit's; meaning it's strong in floating point (I reckon, you had some cascade programable chips pulling divx decoding at 100/133 MHz when a general purpose cpu needed 700 MHz) and doesn't take much silicon (compared to cores meant general purpose), and that makes it so that you probably happen to be right on most scenarios; but...


They could be faster doing small tasks that get repeated a lot; for instance, I can see it being faster in zlib per core; they behave like a dedicated chip (or chips, if you chain them) you placed there specifically for doing that task against a more general approach of having a good out of order general purpose cpu that has to parse stuff through it in a timely manner.

Of course you can manipulate SPE's in order to pull multiple tasks (as if the PPE is the branch predictor) which is what I understand devs like Guerilla and Naughty Dog did, but that's precisely the catch, I can't picture SPE's running an OS or something, they're clearly meant to take the load off the CPU on stuff that suits them; and if it doesn't suit them... they can try power throttle by chaining nonetheless.

I could picture them not even being able to pull a DMIPS score though (or pulling a really low one, seeing it's meant to measure general purpose performance), so I have no idea what to expect performance-wise (and in the end how does that fare against 3 PPE's/Xenon). But that's a very interesting thing to delve with.
 
Of course you can manipulate SPE's in order to pull multiple tasks (as if the PPE is the branch predictor) which is what I understand devs like Guerilla and Naughty Dog did, but that's precisely the catch, I can't picture SPE's running an OS or something, they're clearly meant to take the load off the CPU on stuff that suits them; and if it doesn't suit them... they can try power throttle by chaining nonetheless.

You're aware the PS3 OS largely runs on a reserved SPE, right?
 
You're aware the PS3 OS largely runs on a reserved SPE, right?
I know they have one SPE reserved for OS tasks, yes. But I don't know about the "largely". I understand even things using SpursEngine (which is CELL stripped of the PPE and left with 4 SPE's) use an off-die cpu to run operating system/menu's and the like (and generally tell them what to do).

I don't imagine the OS to be running solely, or largely, on the SPE. From the way I've seen it since forever I guessed it was probably pulling the same thing Core Image pulls on MacOS with dedicated resources in order to keep the system fluid and/or performing helping tasks (and perhaps security, checking for authorizations and the like; like Nintendo used the embedded ARM on the Wii or how AMD plans on using them in this year's x86 CPU's); I certainly don't imagine it to be running the kernel, input/output or managing devices.


I also realize I tend to see things very black and white, that said, it's not that black and white is a bad way to see things, but thing is if you really needed to run some basic operating system on a GPU you probably could at this point; thing is it really doesn't make much sense; perhaps the same could be said for SPE's. My sense though tells me the OS core of operation makes more sense in the PPE using the SPE as a complement, so I highly doubt they had to go to that extent or wanted to.


I don't even know what I was searching just now, probably (PS3+SPE+OS or so) but I hit this randomly without even looking for it specifically, so:

High Moon: (...) comparing it directly to the Xbox 360, you know the Xbox 360 has three general purpose processors in it. But they’re more like the typical processors that you might see in a PC or Macintosh… With the big general purpose processors, we can write the software traditionally the way we’ve done it in the past, so we don’t have to change things so much.

What Sony did was the Cell processor is it really embeds about seven processors, one of those being the general purpose core and the other six being these real dedicated specific-use type of processors that are extremely fast. But seeing that they’re not general purpose, they’re a little bit more challenging for programmers to get under control and to write software for.

With these Cell processors and these small processors called the SPEs, we really have to not only write software different but we have to think about how we’re solving problems in a completely different light.
Source: http://www.destructoid.com/what-mak...a-developer-speaks-up-and-tell-us-30126.phtml

I didn't even know the developer, but that's pretty much the image/opinion I had of Cell ever since the first tech papers/descriptions/analysis came around. I really don't imagine SPE's running an OS all themselves; helping yes. The rest just kinda steams into the core of the topic of conversation right now.
 

blu

Wants the largest console games publisher to avoid Nintendo's platforms.
zlib is a pure integer task. Other pure integer tasks that run with great efficiency on SPEs include text processing, XML processing, whole IP stacks including DNS, and indeed the Dhrystone benchmark.

SPEs don't need "injecting" help. They can load code themselves. They can compute the 501st prime number, multiply by pi (but only on a Sunday, otherwise divide), round down to nearest multiple of 4, use the result as an address and jump to it. Their ISAs are aggressively turing complete. They can do anything and everything a general purpose CPU can do, because they are general purpose CPUs.
Agreed. And yet..

The only separation point between an SPE and any other general purpose CPU is how main memory access works. But just to be clear, an SPE can access main memory, all of it, all by itself.
There's the catch, isn't it? Having to DMA every single bit of data that general-purpose CPU can use makes it rather special, no? Or looking at it from a different perspective, when was the last time you worked on a general purpose CPU with 256KB of address space? ; )
 

pottuvoi

Banned
I was talking about how all of the componenets that make up a full, true CPU core are physically not present in the Cell's SPEs, primarily the Integer Units.


Espresso should run circles around the Cell in integer performance.
SPEs use floating point units to calculate integers and performance is comparable to floats. (source)
If I remember correctly there's limitations, but given fitting task a single SPE will run around Espresso even with integer task.
There's the catch, isn't it? Having to DMA every single bit of data that general-purpose CPU can use makes it rather special, no? Or looking at it from a different perspective, when was the last time you worked on a general purpose CPU with 256KB of address space? ; )
Of course, but some of us remember those 256KB multitasking capable machines with fondness. ;)
 
damn I love this thread, just finished an advanced digital design class last semester and being able to actually just learn about all this stuff from reading these posts feels gud man.
 
Nintendo and IBM signed a billion dollar deal to develop a CPU
for the GameCube and future consoles. Does this deal
extend as far as the WiiU? I think so.
If so, it means that Nintendo and IBM have been thinking long term,
compared to Sony and MS, who are thinking short term as
they take an adhoc approach to console design.

This long term planning has paid off for Nintendo with backwards
compatibility and pushing the boundaries of efficiency without sacrificing
decent & reliable performance. And most likely keeping costs
relatively low in production.

In conclusion, I think its foolish to underestimate Expresso.
This CPU is definitely not an off the shelf solution if we consider its an
evolution since the Gamecube. A lot of thought and design went into
making the product.

At this point I believe the returns have diminished, as PowerPC has proven to poorly scale performance-powerwise. Take away the BC; they could have gotten better performance and even efficiency, even if it meant going with a bigger unit, had they gone x86, for instance.

I think Sony actually thought things through this time. And theirs isn't an "off-the shelf" part either.
 

AzaK

Member
At this point I believe the returns have diminished, as PowerPC has proven to poorly scale performance-powerwise. Take away the BC; they could have gotten better performance and even efficiency, even if it meant going with a bigger unit, had they gone x86, for instance.

I think Sony actually thought things through this time. And theirs isn't an "off-the shelf" part either.
As mentioned before they probably did it to save having to retool and relearn. What they should have done though is made a version with more cores or if it was financially viable added decent SIMD. They seem to have bet on compute but that's going to firstly take a while for people to get to grips with and secondly take previous resources away from the mediocre GPU.

It does feel a bit like not seeing the wood for the trees sometimes.
 
As mentioned before they probably did it to save having to retool and relearn. What they should have done though is made a version with more cores or if it was financially viable added decent SIMD. They seem to have bet on compute but that's going to firstly take a while for people to get to grips with and secondly take previous resources away from the mediocre GPU.

It does feel a bit like not seeing the wood for the trees sometimes.

Yeah. At any rate, being way behind the other two in performance is one thing, but to also remain on the same architecture while the other two move on to x86 really makes them the odd one out. It's a shame, because the PS4/720 will likely have much easier cross-portability with the PC than their predecessors.

Then again, as far as ease goes, Nintendo has had problems adapting to HD development anyways. Going x86 would have made their platform easier to port to for 3rd parties. Of course, that hasn't been their strong suit.
 

AzaK

Member
Yeah. At any rate, being way behind the other two in performance is one thing, but to also remain on the same architecture while the other two move on to x86 really makes them the odd one out. It's a shame, because the PS4/720 will likely have much easier cross-portability with the PC than their predecessors.

Then again, as far as ease goes, Nintendo has had problems adapting to HD development anyways. Going x86 would have made their platform easier to port to for 3rd parties. Of course, that hasn't been their strong suit.

A decent compiler will do all the work for you regarding getting code going so I don't see that as too much of a problem in and of itself I just would have liked better FP and overall performance to help with ports. It seems to hold its own reasonably well (NFS player count notwithstanding) but with the new platforms sporting decent SIMD and 8 cores then something a little more modern in that regard would have been nice.
 

z0m3le

Banned
I don't think the architecture is a problem. If Nintendo was really worried about performance of the CPU or GPU, they would of just not used such old node processes. Espresso at 28nm could likely achieve 1.6GHz quite easily, have a much lower power consumption, even adding a 4th core wouldn't of been hard. SIMD support being off loaded to the GPU is where the industry seems to be going, the main reason this hasn't happened yet is Intel's ability in that regard has been slow to adapt, but they will be there over the next few years.

Likewise moving the GPU down to 28nm would allow it to have even 640ALUs at a similar (though higher) power consumption. They could probably stay below 50watts and have .9 to 1TFLOPs of performance from that chip.

The RAM could also have been doubled at little cost to both TDP and price... I have a theory that they were building Wii U with the idea of turning it into a handheld later on. Basically the architecture Nintendo is building for their next Console and Handheld (it will be shared) would be based on Wii U, fitting what Wii U is right now, into something that can be a handheld in 2017-2018 is possible thanks to the very old 45nm process and 40nm process of the CPU and GPU. At least this is the only thing that truly makes sense with those processes being used when 32nm was cheap, available and would help them reach their goals far easier/quicker.
 

krizzx

Junior Member
I'm still having trouble understanding where the limits of the CPU are. If they could port Call of Duty, Skylanders and NFS carbon to the Wii, then why are they having problems on CPU that's all around better?

There are too many things not adding up about this CPU rhetoric.
 
I'm still having trouble understanding where the limits of the CPU are. If they could port Call of Duty, Skylanders and NFS carbon to the Wii, then why are they having problems on CPU that's all around better?

There are too many things not adding up about this CPU rhetoric.

Good Point.
 
Top Bottom