• Hey, guest user. Hope you're enjoying NeoGAF! Have you considered registering for an account? Come join us and add your take to the daily discourse.

Wii U CPU |Espresso| Die Photo - Courtesy of Chipworks

krizzx

Junior Member
So what you're saying is that you've entered this thread without reading even the first post and yet you've had enough time to publish insulting messages to other people that actually cares about this.

Look, no one is going to answer to this bs you've posted here because no one wants to lose time explaining what's been explained dozens of times.
If you want to participate on this thread, do it with a logical approach, but this is no place for console wars like the one you're desperately trying to start.

citizen_cane.gif






Has anyone here seen any recent Project C.A.R.S. changelogs? Those gave some good insight into the CPU's functionality(and GPU for that matter).
 

Blizzard

Banned
Has anyone here seen any recent Project C.A.R.S. changelogs? Those gave some good insight into the CPU's functionality(and GPU for that matter).
It may be unlikely, since the last time a Project C.A.R.S. member posted (NDA-protected) information, I think their membership got revoked.

It's the same kind of thing as having access to a devkit -- you have information, but you're not legally allowed to share it.
 

krizzx

Junior Member
It may be unlikely, since the last time a Project C.A.R.S. member posted (NDA-protected) information, I think their membership got revoked.

It's the same kind of thing as having access to a devkit -- you have information, but you're not legally allowed to share it.

Aw, well that is regrettable. Project C.A.R.S. almost single handedly thwarted nearly every attempt to play down the capabilities of the hardware to the lowest theorems aside from Shin'en.

Don't we have some of the major devs for the game on this sight though? Maybe someone could PM them a few questions.
 

prag16

Banned
Has this been posted here yet? Some insane stuff. The dev tools sound horrid, though there is room for speedup of code.

http://www.eurogamer.net/articles/digitalfoundry-2014-secret-developers-wii-u-the-inside-story
What I don't get is the Green Hills stuff. The tools are touted as great and super fast for debugging, etc. But this guy says they're shit.

And the Green Hills deal didn't come about until March 2012. Are we to believe devs of a launch game didn't get their initial kits and tools until 8 months before launch??
 

krizzx

Junior Member
I find a lot that was said in that article questionable.

Code optimised for the PowerPC processors found in the Xbox 360 and PlayStation 3 wasn't always a good fit for the Wii U CPU, so while the chip has some interesting features that let the CPU punch above its weight, we couldn't fully take advantage of them. However, some code could see substantial improvements that did mitigate the lower clocks - anything up to a 4x boost owing to the removal of Load-Hit-Stores, and higher IPC (instructions per cycle) via the inclusion of out-of-order execution.

There are so many statements made that contradict others. This statement right here pretty much says that it had the power(contrary to what they were saying right before) but that they simply did not manage to take advantage of it which was more do to them and time constraints than the console's hardware.

Take the thing about code on the CPU seeing a 4x boost. This is something that I was expecting from the CPU since it didn't have the overhead that code for the PS3/360 cpu's had to deal with.

A lot of what is said in the article seems to be skewed by Eurogamer themselves to paint a picture they'd prefer to be seen. The biggest thing the dev kept restating was that their was a huge lack of communication and update that prevented the hardware from being fully utilized properly . Eurogamer seemed to spin this into the Wii U itself being incapable hardware on their own at the end which is far from what the "secret developer" was really saying from what I read.

This info about the CPU, though, seems to be exactly what I and a few others theorized long ago.


If the CPU could get a 4x boost over the 360/PS3 in some areas with only slightly optimized code then wouldn't point to the CPU being much more next gen than the article tries to write it off as when used properly?

Though, could one of the more proffesionally versed posters go into detail on the specifics of this?
 
No. It doesn't matter if it can do certain things 4 times as fast if most of the things you actually need to do are running significantly slower. The net result is still a performance loss.
 

NBtoaster

Member
If the CPU could get a 4x boost over the 360/PS3 in some areas with only slightly optimized code then wouldn't point to the CPU being much more next gen than the article tries to write it off as when used properly?

Some code could get a 4x boost up which mitigated the lower clocks. A 4x boost from its previous bad performance, not over the PS360.
 

krizzx

Junior Member
No. It doesn't matter if it can do certain things 4 times as fast if most of the things you actually need to do are running significantly slower. The net result is still a performance loss.

I did not read where he said that anywhere. How did you come to this conclusion? The bottom line he enforced was that their wasn't enough support to fully utilize the hardware properly and that the code simply wasn't that compatible to begin with, as in the biggest problem was that the CPU's were too inherently different.

You couldn't take code made for the PS3/360 and put it on the Wii U's CPU and get optimal performance. They managed to get huge boost with some alterations but they couldn't do more do to lack of communication and documentation from Nitnendo..

Code optimised for the PowerPC processors found in the Xbox 360 and PlayStation 3 wasn't always a good fit for the Wii U CPU

That is what I read.

The overall tone of what the dev was stating was that they couldn't even figure out how to use the Wii U do to lack of support. He did not paint the hardware as weak at all.

He said the GPU was much more capable and that RAM gave them no problems whatsover desipte presumtions that it cause problems.
I've also seen some concerns about the utilisation of DDR3 RAM on Wii U, and a bandwidth deficit compared to the PS3 and Xbox 360. This wasn't really a problem for us. The GPU could fetch data rapidly with minimal stalls (via the EDRAM) and we could efficiently pre-fetch, allowing the GPU to run at top speed.
. The only complaint he made was about the CPU and on that end it was mostly compatibility and support issues, not power.

Also, if I'm not mistaken, 90%+ of what he is talking about was pre-launch on unfinished hardware. None of this stuff is and issue now according to what every current Wii U dev I've read about is stating which makes me wonder why this old new has come up all of a sudden and it is being regarded like these are current issues at NIntendo and for the Wii U when they don't appear to be.


Is it just me, or are only the few most negative things said in the article getting 100% of the focus?
 
Why do you pretend to ask questions if you are just going to imagine answers that absolve Nintendo of any responsibility AND/OR refuse to acknowledge any evidence that does not conform to your delusional opinion of the WiiU hardware?
 

krizzx

Junior Member
Why do you pretend to ask questions if you are just going to imagine answers that absolve Nintendo of any responsibility AND/OR refuse to acknowledge any evidence that does not conform to your delusional opinion of the WiiU hardware?



I posted exact quotes from the article that directly collaborated my statement about the CPU and you respond to with a personal attack about something I have no recolection of doing?

Exactly what about my opinion is delusional and where am I absolving Nintendo of any responsibility?
 

TKM

Member
Article complains of CPU weaknesses in several places:

So a basic comparison/calculation makes the Wii U look, on paper at least, significantly slower than an Xbox 360 in terms of raw CPU. This point was raised in the meeting, but the Nintendo representatives dismissed it saying that the "low power consumption was more important to the overall design goals" and that "other CPU features would improve the performance over the raw numbers".

Some people even built custom PC rigs with under-clocked CPUs to try and gauge performance of their code on these machine. Again, the almost universal answer was that it wasn't going to be powerful enough to run next-gen engines and it might even struggle to do current-gen (PS3 and X360) titles.

As far as the CPU optimisations went, yes we did have to cut back on some features due to the CPU not being powerful enough. As we originally feared, trying to support a detailed game running in HD put a lot of strain on the CPUs and we couldn't do as much as we would have liked. Cutting back on some of the features was an easy thing to do, but impacted the game as a whole. Code optimised for the PowerPC processors found in the Xbox 360 and PlayStation 3 wasn't always a good fit for the Wii U CPU, so while the chip has some interesting features that let the CPU punch above its weight, we couldn't fully take advantage of them.

On the GPU side, the story was reversed. The GPU proved very capable and we ended up adding additional "polish" features as the GPU had capacity to do it.

Here, they praise the GPU as very capable, and imply the CPU was the opposite of that.
 

krizzx

Junior Member
Article complains of CPU weaknesses in several places:









Here, they praise the GPU as very capable, and imply the CPU was the opposite of that.

He spoke of the CPu's assumed power, porting issues, compatibility problems and strength, not so much weakness. no where did he say the CPU couldn't outright do something. He said they had difficult doing certain thing on the CPU which he later attributed to the lack of Nintendo's support, not an aboslute deficiency in the hardware if I recall correctly.

Let me double check.

The first two quotes you posted were just peoples assumptions based on the paper specs. Only the third one detailed specifically working with the CPU.

Yeah, that third one he specifically attributed those problems to the compatibility of the code and their own issues with making effective use of the CPU, not so much the CPU's weakness.

As far as the CPU optimisations went, yes we did have to cut back on some features due to the CPU not being powerful enough. As we originally feared, trying to support a detailed game running in HD put a lot of strain on the CPUs and we couldn't do as much as we would have liked. Cutting back on some of the features was an easy thing to do, but impacted the game as a whole. Code optimised for the PowerPC processors found in the Xbox 360 and PlayStation 3 wasn't always a good fit for the Wii U CPU, so while the chip has some interesting features that let the CPU punch above its weight, we couldn't fully take advantage of them. However, some code could see substantial improvements that did mitigate the lower clocks - anything up to a 4x boost owing to the removal of Load-Hit-Stores, and higher IPC (instructions per cycle) via the inclusion of out-of-order execution.

You are omitting the bolded and the underlined which do not say the CPU is weak at all. He said they couldn't even use the CPU's fully so how it be weak when it was admittingly never fully utilized by the dev?
 
Some code could get a 4x boost up which mitigated the lower clocks. A 4x boost from its previous bad performance, not over the PS360.
My memory is fuzzy...

So they opted for a somewhat old architecture and relative weak performance cpu (besides tdp) for back comp?

Is incredible how badly priorities were handled in the design stages.
 

wsippel

Banned
My memory is fuzzy...

So they opted for a somewhat old architecture and relative weak performance cpu (besides tdp) for back comp?

Is incredible how badly priorities were handled in the design stages.
No, they opted for the chip because it's a very efficient design and an architecture their development teams are very comfortable with. Easy BC was an added bonus.
 
No, they opted for the chip because it's a very efficient design and an architecture their development teams are very comfortable with. Easy BC was an added bonus.

Yeah, no.
Hardware and software backwards compatibility was one of the main advertisement point of the WiiU.

The same story with the 3DS. Saying the backwards compatibility is just a bonus for Nintendo is nonsense.
 

BaBaRaRa

Member
My memory is fuzzy...

So they opted for a somewhat old architecture and relative weak performance cpu (besides tdp) for back comp?

Is incredible how badly priorities were handled in the design stages.

To be fair, they'll have had 10 years worth of software libraries, tool chains, documentation, optimisations, hacks, general knowledge, etc.

I can imagine been in that meeting and that sounding like a pretty compelling argument compared to throwing all that out for likely substandard netbook x86 cores. Backwards compatibility will have been the cherry on top.

Again, just echoes of the insular thinking. All these ideas will have been great within Nintendo meeting rooms.

Edit: just like they said
 

wsippel

Banned
Yeah, no.
Hardware and software backwards compatibility was one of the main advertisement point of the WiiU.

The same story with the 3DS. Saying the backwards compatibility is just a bonus for Nintendo is nonsense.
Nintendo considered implementing BC using a separate chip. That's what they did on 3DS. BC is important to them, but there are different ways to handle that.
 

Argyle

Member
To be fair, they'll have had 10 years worth of software libraries, tool chains, documentation, optimisations, hacks, general knowledge, etc.

I can imagine been in that meeting and that sounding like a pretty compelling argument compared to throwing all that out for likely substandard netbook x86 cores. Backwards compatibility will have been the cherry on top.

Again, just echoes of the insular thinking. All these ideas will have been great within Nintendo meeting rooms.

Edit: just like they said

Hahah, and if the article is correct - it's like they threw all that stuff out the window anyway (for sure they went ahead and changed up the tool chains, right?)...
 

wsippel

Banned
Hahah, and if the article is correct - it's like they threw all that stuff out the window anyway (for sure they went ahead and changed up the tool chains, right?)...
They switched from Freescale CodeWarrior to GHS MULTI, as CodeWarrior focuses on ColdFire, CoreIQ and PowerQUICC cores these days. Also, MULTI produces the fastest and most secure code for ppc750 cores. So yes, they switched to a different toolchain. All other benefits still apply, though.
 
Nintendo considered implementing BC using a separate chip. That's what they did on 3DS. BC is important to them, but there are different ways to handle that.

Then you would have the expensive situation of the early Playstation 3 with PS2 hardware.

It wasn't an accident that the WiiU hardware is able to run in a Wii mode but was well planned.
 

BaBaRaRa

Member
Then you would have the expensive situation of the early Playstation 3 with PS2 hardware.

It wasn't an accident that the WiiU hardware is able to run in a Wii mode but was well planned.

It may not have been that expensive as it may only have needed the wii cpu on a separate chip coupled with the gpu shiv. It wouldn't have to be an entire wii inside the alternate wiiu.

Anyway, expensive or otherwise, I still refute the notion that the cpu ended up like it did ONLY for backwards compatibility. Rather, that was one of many reasons that seemed good at the time (a of which still stand, I'm sure).
 
No, they opted for the chip because it's a very efficient design and an architecture their development teams are very comfortable with. Easy BC was an added bonus.
NO?... So they had the right priorities in mind. So everything that followed the console launch was part of their planing:

Here is were we see how having a similar cpu architecture payed off in having a sustained amount of releases after launch, except it was the opposite.

That same architecture also helped to strenght Nintendo 3rd parties relationships and boosted their catalogue. After all, the architecture was around the GameCube days and was well documented. Yet no, it didn't happen that way.

Going with that power PC architecture also helped to bridge out and homogenize both console and handheld development. Except no, after all their mobile devices have been using ARM parts for years

wsippel, it seems you are right after all and you can disregard my previous post entirely like you did because it was a bunch of none sense.
 

blu

Wants the largest console games publisher to avoid Nintendo's platforms.
NO?... So they had the right priorities in mind. So everything that followed the console launch was part of their planing:

Here is were we see how having a similar cpu architecture payed off in having a sustained amount of releases after launch, except it was the opposite.

That same architecture also helped to strenght Nintendo 3rd parties relationships and boosted their catalogue. After all, the architecture was around the GameCube days and was well documented. Yet no, it didn't happen that way.

Going with that power PC architecture also helped to bridge out and homogenize both console and handheld development. Except no, after all their mobile devices have been using ARM parts for years

wsippel, it seems you are right after all and you can disregard my previous post entirely like you did because it was a bunch of none sense.
No offense, but the nonsense is entirely in your court.

Exactly where did the anonymoys dev complain about the architecture? If anything, he was giving it credit compared to the PPE in the ps360. He complained about dev tools and performance, and the latter is purely quantitative measure. Had WiiU ended up with 8 Espressos, and the ps4/xbone with 3 Bobcats (which is what was on the table when WiiU was finalized), would you have complained against the x86 as well?
 

Vanillalite

Ask me about the GAF Notebook
Is it just me, or are only the few most negative things said in the article getting 100% of the focus?

Negative should be the focus. But as I said in that thread it's a harsh reality on Nintendo's setup and support for 3rd party devs.

The actual hardware info was few and far between and rather all over the place. There wasn't really much that could be taken from that article other than some really vague inferences both positive and negative which we already all knew about anyways before that article.
 
No offense, but the nonsense is entirely in your court.

Exactly where did the anonymoys dev complain about the architecture? If anything, he was giving it credit compared to the PPE in the ps360. He complained about dev tools and performance, and the latter is purely quantitative measure. Had WiiU ended up with 8 Espressos, and the ps4/xbone with 3 Bobcats (which is what was on the table when WiiU was finalized), would you have complained against the x86 as well?
No place? Which in the end does nothing to rest any meaning of what i said. To be clear, Nintendo opted for the architecture and low end part because of a series of priorities that seem to end up hurting them more than helping them. Thus the part of my post where i said Nintendo had the priorities in the wrong order.

And i proceeded to explain why. Do you remember this is the cpu thread and not the DF article one? Althought i admit what compelled me to post here was the article exposing a bit more clearly the type of problems devs had at launch with the CPU.

I think i explained myself clearly here blu.
 

wsippel

Banned
NO?... So they had the right priorities in mind. So everything that followed the console launch was part of their planing:

Here is were we see how having a similar cpu architecture payed off in having a sustained amount of releases after launch, except it was the opposite.

That same architecture also helped to strenght Nintendo 3rd parties relationships and boosted their catalogue. After all, the architecture was around the GameCube days and was well documented. Yet no, it didn't happen that way.

Going with that power PC architecture also helped to bridge out and homogenize both console and handheld development. Except no, after all their mobile devices have been using ARM parts for years

wsippel, it seems you are right after all and you can disregard my previous post entirely like you did because it was a bunch of none sense.
I don't remember saying a thing about 3rd parties or handhelds. I don't really get your point regarding third parties in the first place. Console developers have more than a decade of experience with PowerPC, it's no alien technology to them, and they wouldn't support a hypothetical x86 based Wii U either, so what difference would it make?
 
I don't remember saying a thing about 3rd parties or handhelds. I don't really get your point regarding third parties in the first place. Console developers have more than a decade of experience with PowerPC, it's no alien technology to them, and they wouldn't support a hypothetical x86 based Wii U either, so what difference would it make?
Well you disqualified the entire post with the "no" answer. That there are other reasons for them to go with that CPU doesn't mean what was stayed was false. Like BaBaRaRa said in a respectful and informative answer no less.

Also the claim was they didn't prioritize correctly. After that i just explained myself a bit more indepth.

Oh and the chances for developers supporting Nintendo and the Wii U in the first place would have increased if the CPU in question had more performance head room, even in the power pc architecture. But the chances for support could only have increased if that hypothetical higher performance part would be x86 based. You know, like the other 2 competing platforms in the market.

But the above doesn't hold even the slight bit of sense im guessing, right? XD
 

wsippel

Banned
Well you disqualified the entire post with the "no" answer. That there are other reasons for them to go with that CPU doesn't mean what was stayed was false. Like BaBaRaRa said in a respectful and informative answer no less.
If you say "they did X because of Y", when Y was at best a (minor) part of the reasons they did X, your statement is wrong. My post wasn't intended to be disrespectful or hostile or anything.
 
Okay, this is not Espresso related but Gekko/Broadway/Espresso related, but if someone could explain it to me it would be great.

So we have this bit of text from the DF article:
However, some code could see substantial improvements that did mitigate the lower clocks - anything up to a 4x boost owing to the removal of Load-Hit-Stores, and higher IPC (instructions per cycle) via the inclusion of out-of-order execution.

So, that "removal of Load-Hit-Storages" it's referred to the memory instructions that store or load the results on the L1/L2 caches, doesn't it?
For what I understand, on the PPE cores from Xbox 360 or Cell those kind of instructions had to be executed in order and the integer, fpu or vmx execution units had to stall a bit until those were resolved and the data from the caches was retrieved and stored on the registers.
Is that true? If that's the case, this could explain why some code could execute as much as 4 times faster on the Espresso than on the Xenon PPE.

In case this is true, how does that feature compares to other CPUs and specifically to bobcat/jaguar? I mean, is it also included on the bobcat/jaguar CPUs for example?
Thank you!
 

Argyle

Member
Okay, this is not Espresso related but Gekko/Broadway/Espresso related, but if someone could explain it to me it would be great.

So we have this bit of text from the DF article:


So, that "removal of Load-Hit-Storages" it's referred to the memory instructions that store or load the results on the L1/L2 caches, doesn't it?
For what I understand, on the PPE cores from Xbox 360 or Cell those kind of instructions had to be executed in order and the integer, fpu or vmx execution units had to stall a bit until those were resolved and the data from the caches was retrieved and stored on the registers.
Is that true? If that's the case, this could explain why some code could execute as much as 4 times faster on the Espresso than on the Xenon PPE.

In case this is true, how does that feature compares to other CPUs? I mean, is it also included on the bobcat/jaguar CPUs for example?
Thank you!

A good explanation of the load-hit-store penalty (written by Burger Becky) is here:

http://www.gamasutra.com/view/feature/132084/sponsored_feature_common_.php?print=1

I have no idea if this penalty exists on the Jaguar. Most likely not, at least not in the same form that it exists on the PPC cores used on the 360/PS3.
 

wsippel

Banned
Okay, this is not Espresso related but Gekko/Broadway/Espresso related, but if someone could explain it to me it would be great.

So we have this bit of text from the DF article:

So, that "removal of Load-Hit-Storages" it's referred to the memory instructions that store or load the results on the L1/L2 caches, doesn't it?
For what I understand, on the PPE cores from Xbox 360 or Cell those kind of instructions had to be executed in order and the integer, fpu or vmx execution units had to stall a bit until those were resolved and the data from the caches was retrieved and stored on the registers.
Is that true? If that's the case, this could explain why some code could execute as much as 4 times faster on the Espresso than on the Xenon PPE.

In case this is true, how does that feature compares to other CPUs and specifically to bobcat/jaguar? I mean, is it also included on the bobcat/jaguar CPUs for example?
Thank you!
Seems to be a problem specific to in-order cores that could cause a significant stall on PPEs, and a very real issue:

Ask any Xbox 360 performance engineer about Load-Hit-Store and they usually go into a tirade. The sequence of a memory read operation (The Load), the assignment of the value to a register (The Hit), and the actual writing of the value into a register (The Store) is usually hidden away in stages of the pipeline so these operations cause no stalls. However, if the memory location being read was one recently written to by a previous write operation, it can take as many at 40 cycles before the "Store" operation can complete.
http://www.gamasutra.com/view/feature/132084/sponsored_feature_common_.php

OoOE cores shouldn't have that problem, be it Espresso or Jaguar.
 
Thank you very much for the links and the information!

Edit: If that's not asking a bit too much, I have a doubt regarding the pipeline depth/width of Gekko/Broadway compared to Bobcat and the DMIPS results of both CPUs.
Bobcat has, like the Gekko, two ALUs who like the Gekko have different capacities (in both CPUs one can execute any instruction while the other one deals with every instruction except for divisions and multiplications), and for what I understand in Bobcat integer divisions are resolved using part of the FPU circuitry and they have a higher penalization.
The Bobcat has a better OOE circuitry, but it's for what I've read completely dual issue while the Gekko is dual issue but in case of a branch instruction it can issue up to 3 instructions per cycle (I've looked for that feature on the bobcat cpu's but or it's something as basic that it's a given nowadays and not explained, or the 2 issued instructions per cycle also contain the branch instructions).

For what wsippel explained on previous pages, the Gekko also has more integer registers than the Bobcat, and also a much shorter pipeline that helps on increasing performance per clock (less cycles wasted when the pipeline stalls).
To make things even worse (for the bobcat), it's L1 cache is only 2 way set-associative while on the Gekko it was 8 way set-associative.
But despite that, it seems that the bobcat has higher performance when it comes to integer calculations per cycle.

Could that be due to the more advanced OOE system of the Bobcat? Or there's something else there that I've missed or interpreted wrong? Maybe the micro ops of the bobcat are more powerful than the ones found on the PPC instruction set (meaning that what on the Bobcat it's a single instruction, on the Gekko/Broadway architecture has to be replicated with multiple instructions instead)?

I doubt that the bigger L2 cache has much impact on drystone tests like those (it's 512 KB on the bobcat vs 256 KB on the Gekko/Broadway, but those tests are as far as I know fully resolved on the 32+32 KB of L1 cache).

The bobcat information I have is from there so I doubt it's wrong:
http://www.agner.org/optimize/microarchitecture.pdf

Edit 2: Besides the comparison between Gekko/Broadway and Bobcat, what things to you think could be changed in the Espresso that would be easy to implement without breaking the BC? Maybe a bigger OOE circuitry that could reorder more than 6 instructions (so to speak, to make the instruction queue bigger than 6 entries) could be feasible? Increased registers in order to reduce the times some data has to be retrieved from the L1 data cache?

I know that I'm making a ton of answers... sorry!!! XD
 
Hey I know I'm asking tooooo much and maybe those last questions were difficult, but now I've seen that the L1 of Gekko, that's only 32KB big, is 8 way set-associative.
I remember that back when it was confirmed that Espresso's 512KB of L2 cache on the small cores was only 4 way set-associative you said that for the amount of memory that was fine and that there wouldn't be that much of a benefit if it was let's say 8 way-set associative, but then here's my question:
Why does the L1 cache have 8 way associativity?
I mean, I know that the lower the memory level the most important is to take profit of every bit of it, but if 512 KB is small enough that 4 to 8 way set-associative wouldn't make that much of a difference, isn't 8 way associativity a bit too overkill for just 32KB of data?

Thanks!
 

blu

Wants the largest console games publisher to avoid Nintendo's platforms.
Hey I know I'm asking tooooo much and maybe those last questions were difficult, but now I've seen that the L1 of Gekko, that's only 32KB big, is 8 way set-associative.
I remember that back when it was confirmed that Espresso's 512KB of L2 cache on the small cores was only 4 way set-associative you said that for the amount of memory that was fine and that there wouldn't be that much of a benefit if it was let's say 8 way-set associative, but then here's my question:
Why does the L1 cache have 8 way associativity?
I mean, I know that the lower the memory level the most important is to take profit of every bit of it, but if 512 KB is small enough that 4 to 8 way set-associative wouldn't make that much of a difference, isn't 8 way associativity a bit too overkill for just 32KB of data?

Thanks!
On one hand, higher associativity requires very expensive tricks and/or paying a latency penalty. On the other hand, lower associativity brings higher eviction rates of (potentially needed) data from cache. Both things said, L1 caches are ultra critical for the performance of pretty much any modern performance cpu (just imagine accessing any data that is not in a register taking dozens of cycles), so cpu designers go all-out to make L1 as efficient as possible, and that means minimizing bad evictions.
 
On one hand, higher associativity requires very expensive tricks and/or paying a latency penalty. On the other hand, lower associativity brings higher eviction rates of (potentially needed) data from cache. Both things said, L1 caches are ultra critical for the performance of pretty much any modern performance cpu (just imagine accessing any data that is not in a register taking dozens of cycles), so cpu designers go all-out to make L1 as efficient as possible, and that means minimizing bad evictions.
Thanks! But why would the higher associativity translate into bigger latencies? I thought that this was all a hardware thing and that less associativity meant a simpler memory controller but that there were no other penalties except for the more complex hardware structure and the transistors spent building it.

In that case, is it possible that the 2 way associativity on the Bobcat L1 could be an advantage in simpler tests like the dhrystones that can be fully stored on the L1, but on the other hand a disadvantage on real game scenarios (with more L1 cache misses)?

Thank you!
 

HTupolev

Member
Seems to be a problem specific to in-order cores that could cause a significant stall on PPEs, and a very real issue:

http://www.gamasutra.com/view/feature/132084/sponsored_feature_common_.php

OoOE cores shouldn't have that problem, be it Espresso or Jaguar.
It seems like that would depend on the code, and the complexity of the OoO capabilities. According to the article, you'd occasionally have to work around a stall by cleverly re-ording things forty cycles out of the way (assuming that the code even makes that possible, that's some pretty hardcore OoO shenanigans).

The issue might be bandaided a little with OoOE, but it seems like the root problem is that the cache takes ages to clean itself up following a write, which is an especially large problem if the different execution components don't share registers.

Maybe the best* solution is to make sure that you have a secondary task to run "in parallel" on the same thread for forty cycles? Lol...

*"Best"

Thanks! But why would the higher associativity translate into bigger latencies? I thought that this was all a hardware thing and that less associativity meant a simpler memory controller but that there were no other penalties except for the more complex hardware structure and the transistors spent building it.
The exact characteristics are going to be implementation-dependant, but it may very well take logic (meaningfully) longer to propogate through said "more complex hardware structure."
 
Thanks! But why would the higher associativity translate into bigger latencies? I thought that this was all a hardware thing and that less associativity meant a simpler memory controller but that there were no other penalties except for the more complex hardware structure and the transistors spent building it.

IIrc the higher the desired associativity the more transistors need to be placed in series, and the circuit path might get longer. Kind of like when you multiplex between a larger number of input signals than a single multiplexer can handle, you not only need to place more of them in parallel but also some behind those in order to merge the results.

In that case, is it possible that the 2 way associativity on the Bobcat L1 could be an advantage in simpler tests like the dhrystones that can be fully stored on the L1, but on the other hand a disadvantage on real game scenarios (with more L1 cache misses)?


Possibly.
 

OryoN

Member
Thanks! But why would the higher associativity translate into bigger latencies? I thought that this was all a hardware thing and that less associativity meant a simpler memory controller but that there were no other penalties except for the more complex hardware structure and the transistors spent building it.

I did some research on this a while back, and while I don't remember all the technical details, I think I atleast remembered the concept behind the trade-off of having higher/lower associativity. An analogy could go like:

Having more drawers for your socks could be a good thing, when it comes to storage. But having 8 drawers means you may potentially have to spend more time searching for and retrieving a particular pair of socks to wear, than if you had just 2 drawers. That's where the latency comes in. Nintendo/IBM must have believed that 4-way(up from 2-way in Gekko?) struck the perfect balance for Espresso, in addition to other enhancements.

Hopefully someone can explain the actual details with relevant terms.
 

joesiv

Member
Hmm.. seems like most of that article comes down to developers having code optimized for the long pipe high clock CPU's that were in PS360. Makes sense, it's the end of the generation, and they were the only ones in this "HD" game. All engines are going to be optimized from top down for that type of CPU.

Nintendo coming in "late" with a CPU that's totally different, and requires different optimizations, isn't going to win any win any awards for it's efficiency if it can't take that existing code and run it as is. It's the wrong time to ask developers to go back and re-optimized for this type of CPU. They could, but even if they did, it wouldn't have happened for launch games, this took many years for the PS360, and would be the same for the WiiU (unless they had historical roots developing for NGC/Wii).

Dev's are focusing on making engines for next gen architectures (PS4/XBone), there is less time to give to optimizing their old code branches, which probably are already end of the road.
 
IIrc the higher the desired associativity the more transistors need to be placed in series, and the circuit path might get longer. Kind of like when you multiplex between a larger number of input signals than a single multiplexer can handle, you not only need to place more of them in parallel but also some behind those in order to merge the results.
¿But wouldn't that kind of "latency" be there for the same reason pipelines got longer?
I mean, since the circuit is bigger the speed can't be that high (like with pipelines with bigger stages in comparison with longer pipelines made of smaller stages), but in terms of "latencies in a given speed" then for what you say there wouldn't be any lose.
For example, the simpler associativity could mean to be able to reach 15 Mhz while the higher one could put the limit let's say at 10 Mhz, but at the same speeds, for what you say latencies would be the same, wouldn't it?
 

HTupolev

Member
¿But wouldn't that kind of "latency" be there for the same reason pipelines got longer?
This isn't really analagous to pipelining.

Pipelining involves breaking up a large operation into smaller components. Since each component can be executed quickly, you can clock things faster. Your front-to-back latency is generally going to be higher than the original circuit (since you have to clock according to the slowest stage, and since there can be overhead on buffering between stages), but your throughput can be higher (at least if your task is not perfectly sequential).

Lowering cache associativity, on the other hand, means replacing a large search operation with a smaller one. You're not breaking a big search into small pieces to enable faster clocking; you're replacing a big search with a smaller search which probably just plain takes less time.
 

tipoo

Banned
For what wsippel explained on previous pages, the Gekko also has more integer registers than the Bobcat, and also a much shorter pipeline that helps on increasing performance per clock (less cycles wasted when the pipeline stalls)

Just wanted to say a couple of things. I'm not sure how long you were thinking the Espresso pipeline is, but I may as well take this opportunity to remind everyone that it's 4 stages only for integer, it has 3 more stages for FP at 7 stages.


And there's this prevailing notion that shorter always = better performance, by this thinking the best processor would be 1 stage long pipeline (ie no pipelining), but of course they it would not be. Its so much more complicated. Longer pipelines reduces the amount of gates required to implement each stage (as its smaller) which reduces the propagation delay of each step which increases maximum frequency. This is independent of IPC. Jaguars IPC INCREASED while it's pipeline stages also did, so saying lower pipeline = higher IPC is clearly false.

The only area a longer pipeline reduces performance is in branch misprediction, as the pipeline needs to stall as it gets flushed. However, in this case and in many cases, you can get a sizeable increase in max frequency with only a minor increase in branch misprediction penalty. In Jaguars case, AMD stated 10% higher frequency from one stage longer pipeline, so branch misprediction has gone from 13 to 14 cycles, or a 7.7% bigger penalty. Given a CORRECT branch prediction rates are ~85-90% even in the Pentium days (and I don't know what with something modern, maybe up to 95%?), that 7.7% increase only occurs on 10-15% of branches, so you can do the math. Nevermind, I'll do it, that's about 1.155% of the code.

Long story short, 10% higher frequency, smaller than 10% penalty = net gain in performance. Its about finding the optimal point between reducing stage propagation delays (the returns get smaller the more stages you add) and the increase in branch mis-prediction (which increases linearly with pipeline length). Prescott didn't work for many reasons, not just pipeline length. Just saying longer pipeline = always bad is missing so much.


Rant mode off.
 

blu

Wants the largest console games publisher to avoid Nintendo's platforms.
And there's this prevailing notion that shorter always = better performance, by this thinking the best processor would be 1 stage long pipeline (ie no pipelining), but of course they it would not be. Its so much more complicated. Longer pipelines reduces the amount of gates required to implement each stage (as its smaller) which reduces the propagation delay of each step which increases maximum frequency. This is independent of IPC. Jaguars IPC INCREASED while it's pipeline stages also did, so saying lower pipeline = higher IPC is clearly false.

The only area a longer pipeline reduces performance is in branch misprediction, as the pipeline needs to stall as it gets flushed. However, in this case and in many cases, you can get a sizeable increase in max frequency with only a minor increase in branch misprediction penalty. In Jaguars case, AMD stated 10% higher frequency from one stage longer pipeline, so branch misprediction has gone from 13 to 14 cycles, or a 7.7% bigger penalty. Given a CORRECT branch prediction rates are ~85-90% even in the Pentium days (and I don't know what with something modern, maybe up to 95%?), that 7.7% increase only occurs on 10-15% of branches, so you can do the math. Nevermind, I'll do it, that's about 1.155% of the code.

Long story short, 10% higher frequency, smaller than 10% penalty = net gain in performance. Its about finding the optimal point between reducing stage propagation delays (the returns get smaller the more stages you add) and the increase in branch mis-prediction (which increases linearly with pipeline length). Prescott didn't work for many reasons, not just pipeline length. Just saying longer pipeline = always bad is missing so much.

Rant mode off.
Frankly, I don't know which end to start from addressing your rant, so I'll start at random.

Branch predictors' success rate is critically dependent on the nature of the task and the data set. Success rates of 90% (and above) are really 'best case scenario' territory - some usecases might happily reach that, while others might be a far way south from there. Branch predictors are not magical by any means, their usefulness is as prone to failure as is the usefulness of, say, CPU caches - sometimes it's sub-optimal code which can 'break' them, but other times it's just the nature of the task. And branch mispredictions are not the sole disaster than can happen to a modern pipeline - pipeline flushes occur at virtually every mis-speculation, which can be anything from a memory ordering conflict in an SMP system, through self-modifying code (not unheard of in JIT/virtualisation), to your familiar branch mispredictions (yes, they do form the bulk of the issue).

But your idea that pipeline flushes are the only issue with long pipelines is just wrong. Pipeline bubbles caused by all kinds of penalties are by far the worst offenders, since they usually happen orders of magnitude more often than flushes, and the weaker the reorder logic of the cpu is, the more the bubbles. Of course, by Murphy's law, the longer the pipeline, the larger the bubbles, so longer pars of the pipeline twiddle their thumbs.

Let me just conclude that AMD managed to increase Jaguar's IPC despite of the increase in the pipeline length. But of course, I'm curious to hear your version of what Prescott's greatest issue was.
 

tipoo

Banned
Frankly, I don't know which end to start from addressing your rant, so I'll start at random.

Branch predictors' success rate is critically dependent on the nature of the task and the data set. Success rates of 90% (and above) are really 'best case scenario' territory - some usecases might happily reach that, while others might be a far way south from there. Branch predictors are not magical by any means, their usefulness is as prone to failure as is the usefulness of, say, CPU caches - sometimes it's sub-optimal code which can 'break' them, but other times it's just the nature of the task. And branch mispredictions are not the sole disaster than can happen to a modern pipeline - pipeline flushes occur at virtually every mis-speculation, which can be anything from a memory ordering conflict in an SMP system, through self-modifying code (not unheard of in JIT/virtualisation), to your familiar branch mispredictions (yes, they do form the bulk of the issue).

But your idea that pipeline flushes are the only issue with long pipelines is just wrong. Pipeline bubbles caused by all kinds of penalties are by far the worst offenders, since they usually happen orders of magnitude more often than flushes, and the weaker the reorder logic of the cpu is, the more the bubbles. Of course, by Murphy's law, the longer the pipeline, the larger the bubbles, so longer pars of the pipeline twiddle their thumbs.

Let me just conclude that AMD managed to increase Jaguar's IPC despite of the increase in the pipeline length. But of course, I'm curious to hear your version of what Prescott's greatest issue was.


Well, being designed to perform well at clock speeds in excess of 4GHz and hitting 150W barriers well before that causing all sorts of unexpected compromises and throttling was part of it. The hyper-long pipeline wasn't ideal, true, but that doesn't mean shorter is always better either, I believe I linked an engineering study early in this thread showing 11-14 stages being most optimal for most code regardless of clock speed, and iirc that's mostly where Intel targets (Haswell is a 14 stager).

Even if those were best case scenarios, that's from back in the Pentium 4 days, and even generation to generation back then they were improving the misprediction rate. I don't know what it is now, but I'd be interested to see it too.


About your talk of pipeline flushes, modern processors don't even flush every stage in the pipeline in a misprediction, just the relevant stages now. So something with a XX stage pipeline may flush half of them if half are irrelevant, or any other fraction. I forget what this is called, but I think anything post Core 2 has it. I'll try to find the name.



Anyways, even if some of my rambling did need correction, you agree that pipeline length isn't black and white as was my point, right? There's a balance to be struck and crazy short isn't necessarily ideal as crazy long isn't, was what I was getting at. This lower always = higher IPC notion is what I was attacking.
 
Lowering cache associativity, on the other hand, means replacing a large search operation with a smaller one. You're not breaking a big search into small pieces to enable faster clocking; you're replacing a big search with a smaller search which probably just plain takes less time.
Yes, and this is why I say that 8-way set associativity may have a lower limit when it comes to achievable clocks (because the search can't be redone through smaller chunks of circuitry) but at a given and achievable speed (which will be lower due to the higher size of the circuitry) then I don't see why it would have more latency in terms of cycles spent waiting (not in milliseconds which are speed-dependant).

tipoo said:
And there's this prevailing notion that shorter always = better performance, by this thinking the best processor would be 1 stage long pipeline (ie no pipelining), but of course they it would not be. Its so much more complicated.
What people say is that at a given speed (and of course not considering other hardware differences), the shorter the pipeline the better the performance, and that's true for what I know.

Then it all comes to balance the amount of stages with the achievable speed, but seeing how long the Jaguar pipeline is and how low-clocked it's ended to be (it's obvious that low-power and low-heat have been the priorities there) I seriously doubt that this design approach is better than the relatively high-clocked CPU they've ended with (4 stages integer pipeline vs 20+ of the Jaguar one).

tipoo said:
The hyper-long pipeline wasn't ideal, true, but that doesn't mean shorter is always better either, I believe I linked an engineering study early in this thread showing 11-14 stages being most optimal for most code regardless of clock speed, and iirc that's mostly where Intel targets (Haswell is a 14 stager).
No, 11-14 stages was the hotspot in terms of performance/achievable speed-heat. Less than that would prevent the CPU to be clocked as high and more than that would cripple the performance to a point where the higher speed wouldn't be worth it.

That being said, at a given speed (both at 1,243 Ghz to put an example that has the same clock as the WiiU CPU) 4 stages pipeline will always perform better than a 11 stages pipeline, ignoring all the other factors involved on CPU performance.
 
Yes, and this is why I say that 8-way set associativity may have a lower limit when it comes to achievable clocks (because the search can't be redone through smaller chunks of circuitry) but at a given and achievable speed (which will be lower due to the higher size of the circuitry) then I don't see why it would have more latency in terms of cycles spent waiting (not in milliseconds which are speed-dependant).

The cache will be clocked the same as the other parts of the CPU. The only question is: How many clock cylces will a cache hit cost? This can differ between different CPUs and cache assoiciativity can be one reason for that.
 
The cache will be clocked the same as the other parts of the CPU. The only question is: How many clock cylces will a cache hit cost? This can differ between different CPUs and cache assoiciativity can be one reason for that.
Yes, but my reasoning is: "since the WiiU CPU is designed towards low clocks (very low in fact) thanks to it's short pipelines, what on another high-clocked design wouldn't be feasible without increasing the cycles spent on a cache hit, on the Espresso could be affordable due to the whole CPU being slow (this is why I compared it with pipelining, because the reason for this latencies is that the circuit is bigger and the clock can't potentially go as high as on simpler design)".

I'm not saying that this is the case for sure, just trying to figure if that's a possibility given the low clocks the Espresso is working at.

Regards!
 

tipoo

Banned
Some more bits about the processor in here

http://fail0verflow.com/blog/2014/console-hacking-2013-omake.html

ie
Oh, and that whole out-of-order execution debate? The confusion arose due to the myth that the PPC750 is in-order. It’s a superscalar core: it does dispatch up to 3 instructions per cycle and they can complete independently (and the results are placed in a completion queue). That qualifies as out-of-order. It’s not particularly wide and definitely isn’t nearly as aggressively out-of-order as modern Intel and POWER cores are, though. The Espresso is just as out-of-order as the Broadway and previous members of the 750 family. There’s no upgrade there: it’s a (simple) out-of-order core and it always was. Go read the PPC750CL User’s Manual if you want all the gory details (it also has information on the formerly-Nintendo-proprietary stuff like Paired Singles, DMA, Locked L1, Write-Gather Pipe, etc.).
 

krizzx

Junior Member
I think I asked this earlier, but what are the benefits/problems with having the developer have to manually pic which code runs on which CPU? No one has ever spoken on this are far as I can remember.

Wouldn't their be some benefit to not having the overhead and iffyness of the CPU auto delegating tasks? That would be one process less the CPU needs to execute.
 
Top Bottom