• Hey, guest user. Hope you're enjoying NeoGAF! Have you considered registering for an account? Come join us and add your take to the daily discourse.

WiiU technical discussion (serious discussions welcome)

Donnie

Member
What is more noticeable, a 30% increase in fuel efficiency in a Prius or a 30% increase in fuel efficiency in a full size pick up? You will be saving hundreds if not thousands of dollars more per year on the pick up in fuel costs.

Don't do the car comparison, its always a bad sign, usually flawed and this is no exception.

Increasing the Wii U's performance 30% will make a negligible impact on the graphics. Increasing some monster 200 watt next-gen platform's performance 30% could see dramatic increases in scene complexity which the Wii U could only dream of.

I can't believe you're arguing that adding 30% performance to two consoles would see a negligible noticeable difference in scene complexity for one and a dramatic difference for the other :O Seriously, the difference will be very similar. The scene complexity on the 200w system will increase by more polys ect. But the scene complexity will have been higher to begin with, again both will increase by 30%. If anything its arguable that the scene with lower complexity will show a more noticeable difference due to diminishing returns. But this is now really a totally different argument.

Fact is a 30% increase is a 30% increase no matter what the console, no more dramatic on one than the other.
 

Log4Girlz

Member
But... aren't you doing this comparison backwards?

I mean, let's look at it the other way. Remove 30% of the WiiU's performance, that would be a pretty significant step back would it not? Far from negligible, just like adding 30% is not negligible.

If you remove 30% of the performance of a Wii U game it wouldn't be able to run at the resolutions they do on the TV screen. So instead you run it on the tablet. So you would get the game looking the same regardless :p
 
Wait, where are you getting it could have 30% more power when DF says right here:

One thing that did stand out from our Wii U power consumption testing - the uniformity of the results. No matter which retail games we tried, we still saw the same 32w result and only some occasional jumps higher to 33w. Those hoping for developers to "unlock" more Wii U processing power resulting in a bump higher are most likely going to be disappointed, as there's only a certain amount of variance in a console's "under load" power consumption.
 

AlStrong

Member
I guess this could be tested by checking whether the slowdown is dependent on the amount of screen area covered by blended effects or not.

http://www.lensoftruth.com/infocus-call-of-duty-black-ops-ii-review-and-wii-u-analysis-video/

The first minute or so of the first video should be somewhat indicative.

The alpha stuff is the most puzzling remaining aspect of Wii U performance to me at this point. One potential explanation that was put forth is that doing correct alpha blending using traditional methods would require polygon sorting on the CPU, which could be the reason for the slowdown.

Highly doubt there's performance to spare for an OIT implementation. :p
 

Donnie

Member
Wait, where are you getting it could have 30% more power when DF says right here:

From my point of view its a theoretical discussion, not a claim that WiiU will be 30% more powerful. But I think the discussion came about because of the difference between current power draw and Iwata's 40w comment.

Also that quote only points out that there's a certain percentage of variance to expect from a console under full load, it doesn't say that WiiU is fixed at 33w.
 
Its a theoretical discussion, not a claim that WiiU will be 30% more powerful somehow. Read the thread if you want to know how it came about.

Also that quote only points out that there's a certain percentage of variance to expect from a console under full load, it doesn't say that WiiU is fixed at 33w.

I realize it's theoretical, that's why I said "could have" not "will have".
 

Donnie

Member
I realize it's theoretical, that's why I said "could have" not "will have".

No you misunderstand me, its entirely theoretical, with the focus of the discussion being the comparison of percentage difference vs wattage difference, as far as my argument goes WiiU is merely an example.
 

wsippel

Banned
The WiiU has 4 USB 2.0 ports, one of them can consume up to 2.5W (5V x max. 500mA).
4 USB ports make that up to 4 x 2.5W = 10W.
Having four 2.5W bus powered USB devices connected during gamplay isn't exactly "realistic", though. ;)

Well, I guess we'll have to wait and see. It's probably nothing. Though there's apparently no power management in place at all right now, so maybe Nintendo is planning to implement this down the road and bump the clocks a little while they're at it. Developers couldn't access the second 3DS core until several months after launch either.
 

Log4Girlz

Member
Having four 2.5W bus powered USB devices connected during gamplay isn't exactly "realistic", though. ;)

Well, I guess we'll have to wait and see. It's probably nothing. Though there's apparently no power management in place at all right now, so maybe Nintendo is planning to implement this down the road and bump the clocks a little while they're at it. Developers couldn't access the second 3DS core until several months after launch either.

Hopefully they also get another 256 or 512 MB of RAM.
 

z0m3le

Banned
Having four 2.5W bus powered USB devices connected during gamplay isn't exactly "realistic", though. ;)

Well, I guess we'll have to wait and see. It's probably nothing. Though there's apparently no power management in place at all right now, so maybe Nintendo is planning to implement this down the road and bump the clocks a little while they're at it. Developers couldn't access the second 3DS core until several months after launch either.

Could this be what developers were supposedly locked out of until recently? It's an interesting problem either way, 32watts from a 75watt power supply is odd in 2012 IMO.
 

AmFreak

Member
Having four 2.5W bus powered USB devices connected during gamplay isn't exactly "realistic", though. ;)

Well, I guess we'll have to wait and see. It's probably nothing. Though there's apparently no power management in place at all right now, so maybe Nintendo is planning to implement this down the road and bump the clocks a little while they're at it. Developers couldn't access the second 3DS core until several months after launch either.

DF meassured 32W and a spike "just north of 33w", so use a 2.5" HDD and something else and you are at 38W or north of 39W or in other words "roughly 40W".
 

wsippel

Banned
DF meassured 32W and a spike "just north of 33w", so use a 2.5" HDD and something else and you are at 38W or north of 39W or in other words "roughly 40W".
That still sounds more like peak consumption - an USB hard disk and the internal optical drive don't run at the same time under normal conditions for example. Theoretic peak consumption using bus powered USB devices and everything, according to Iwata, is actually 75W:

The Wii U is rated at 75 watts of electrical consumption.
Please understand that this electrical consumption rating is measured at the maximum utilization of all functionality, not just of the Wii U console itself, but also the power provided to accessories connected via USB ports.
However, during normal gameplay, that electrical consumption rating won't be reached.
Depending on the game being played and the accessories connected, roughly 40 watts of electrical consumption could be considered realistic.
Anyway, yeah, it's probably nothing.
 

TunaLover

Member
Not sure if asked, but every game on Wii U runs with forced v-sync, it was mandated by Nintendo, or this just the way the system works?
 

AmFreak

Member
I thought Nintendo specifically said that the external HDD would need a power supply? I may be mistaken.

It was only an example, it could be whatever usb device.
Here it says you can use whatever HDD, but Nintendo can't guarantee the functionality without a Y-cable or external powered one. What makes sense cause some drives, basicly all 3.5" drives need more power than one USB port can provide.
 

heyf00L

Member
No, they said that if it didn't use a Y cable/external power source they couldn't guarantee it would operate correctly, IIRC.

Correct. This is typical for 2.5" drives and laptops. I've had some that work fine with one USB power and others that need the Y cable. The ones that need a Y cable tend to come with one.
 

Thraktor

Member
Going back a few pages here, but I was doing a bit of reading yesterday, and thought I might reply to this:

What with the OoO execution, short pipeline and large cache, it really looks like the kind of CPU you'd end up with if you wanted the best pathfinding performance possible within very small die size and thermal limits.

...or probably one of the newer ARM cores at this point. Or Intel's mobilized Atom coming from the other end and presumably AMD's low end cores. But I'm guessing other competitive solutions would cost more and not have the benefit of BC (without another added cost) and familiarity, which may have avoided some additional development transition costs.

Actually, the 750 series has a short pipeline even by the standards of short pipeline CPUs. Here's a (partial) set of data on a few of the chips you mentioned (and Xenon for comparison):

Espresso - Wii U
Cores - 3
Multithreaded - No
Clock Speed - 1.25GHz
Pipeline length - 4 stages (+2 for floating point)
L2 Cache - 2MB (Core 1), 512KB (each for Cores 0 & 2)
Out of order execution

Jaguar - AMD (speculated for PS4 and possibly next XBox)
Cores - 2 or 4 (in theory a custom chip could be higher)
Multithreaded - No
Clock speed - up to ~2GHz
Pipeline length - 14 stages (+3 for floating point, not sure about SIMD)
L2 cache - 512KB per core (shared)
Out of order execution

Atom - Intel
Cores - up to 2
Multithreaded - Yes (2 per core)
Clock speed - up to 2.13GHz
Pipeline length - 16 stages
L2 cache - up to 1MB (shared)
In order execution

Cortex-A15 - ARM
Cores - up to 4
Multithreaded - No
Clock speed - up to 2.5GHz
Pipeline length - 15 stages (+up to 10 for floating point/SIMD)
L2 cache - up to 4MB (shared)
Out of order execution

Xenon - XBox 360
Cores - 3
Multithreaded - Yes (2 per core)
Clock speed - 3.2GHz
Pipeline length - 23 stages (+4 to +14 for SIMD)
L2 Cache - 1MB (shared)
In order execution

Corrections from those more knowledgable than me would be appreciated, but hopefully this helps illustrate why Espresso's clock speed is as low as it is. Longer pipelines allow higher clocks, but also mean fewer instructions per clock (IPC) and higher penalties for branch misprediction and code with a large degree of dependencies between close operations (which cause pipeline bubbles). Since the XBox360 was released there's been a trend from the long-pipeline, high-clock architecture of Pentium 4 style chips to shorter-pipeline, high-IPC designs. Even by the standard of short-pipeline designs of today, though, Espresso has an exceptionally short pipeline. It's actually about as short a pipeline as you could possibly get, and even though it's only running at 1.25GHz, Espresso is probably the highest-clocked CPU with a 4 stage pipeline you're ever likely to see.
 

Durante

Member
Nah. Sorting a few hundred polygons on the CPU doesn't take much power. Old software renderers lacking z-buffers did it without a problem while also rasterizing the scene. i.e.: Quake before GLQuake running on a Pentium with MMX™.
Yeah, I also don't consider it particularly likely, but it's one of the few explanations beyond the obvious anyone has tried so far, so I thought it worth bringing up.

In the clip Alstrong linked to above it looks like "traditional" alpha drops -- related to the amount and screen size of the blending effect. Which is still surprising to me, since the eDRAM, with the bandwidth numbers thrown around in this thread, shouldn't let that happen.
 

beje

Banned
Corrections from those more knowledgable than me would be appreciated, but hopefully this helps illustrate why Espresso's clock speed is as low as it is. Longer pipelines allow higher clocks, but also mean fewer instructions per clock (IPC) and higher penalties for branch misprediction and code with a large degree of dependencies between close operations (which cause pipeline bubbles). Since the XBox360 was released there's been a trend from the long-pipeline, high-clock architecture of Pentium 4 style chips to shorter-pipeline, high-IPC designs. Even by the standard of short-pipeline designs of today, though, Espresso has an exceptionally short pipeline. It's actually about as short a pipeline as you could possibly get, and even though it's only running at 1.25GHz, Espresso is probably the highest-clocked CPU with a 4 stage pipeline you're ever likely to see.

I'm a little bit profane to CPU architecture. Does a shorter pipeline mean higher "performance per megahertz" to put it in some simplified way? Or it requires extra expertise from the coders to get advantage from the shorter pipeline?
 
Been wondering about memory setup, and what memory pool is accessed first. Is it the 1gb for game data and graphics then its moved to the 32mbs embedded memory in chunks.
 

Thraktor

Member
I'm a little bit profane to CPU architecture. Does a shorter pipeline mean higher "performance per megahertz" to put it in some simplified way?

Sort of. What Intel found with the Pentium 4 was that the extra-long pipeline caused so many problems that it effectively cancelled out the benefits of the high clock speed. Hence they switched over to a lower-clocked, shorter-pipeline Core architecture, where the "performance per megahertz", as it were, was so much better that it didn't need such high clock speeds.

I wouldn't necessarily say that Espresso's 4 stage pipeline is going to make it super-efficient, though. It's likely that an 8-12 stage design could have given them twice the clock-speed with only a small hit to per-clock performance. The sweet spot these days seems to be within the 10-20 stage range, giving good IPC without sacrificing too much in the way of clock speed.

Or it requires extra expertise from the coders to get advantage from the shorter pipeline?

If anything, it's the other way round. To get good performance out of long-pipeline processors, you need to structure your code in a way which avoids pipeline stalls and bubbles. With a pipeline as short as Espresso's, though, you can pretty much do whatever you want, as a maximum 6-cycle penalty is fairly negligible in the scheme of things.

It's also the kind of thing which affects different kind of code differently (as do all aspects of CPU architectures). The reason for my original comment was that short pipelines are particularly helpful for pathfinding algorithms, as they're very "branchy".
 

beje

Banned
Sort of. What Intel found with the Pentium 4 was that the extra-long pipeline caused so many problems that it effectively cancelled out the benefits of the high clock speed. Hence they switched over to a lower-clocked, shorter-pipeline Core architecture, where the "performance per megahertz", as it were, was so much better that it didn't need such high clock speeds.

I wouldn't necessarily say that Espresso's 4 stage pipeline is going to make it super-efficient, though. It's likely that an 8-12 stage design could have given them twice the clock-speed with only a small hit to per-clock performance. The sweet spot these days seems to be within the 10-20 stage range, giving good IPC without sacrificing too much in the way of clock speed.

If anything, it's the other way round. To get good performance out of long-pipeline processors, you need to structure your code in a way which avoids pipeline stalls and bubbles. With a pipeline as short as Espresso's, though, you can pretty much do whatever you want, as a maximum 6-cycle penalty is fairly negligible in the scheme of things.

It's also the kind of thing which affects different kind of code differently (as do all aspects of CPU architectures). The reason for my original comment was that short pipelines are particularly helpful for pathfinding algorithms, as they're very "branchy".

Very informative, thanks! I new the stuff about Pentium 4 to Core switch, but I didn't really know what was exactly causing the trouble. I guess it's Nintendo being Nintendo again, as it sounds like a CPU that requires very little code optimization.
 

Thraktor

Member
Very informative, thanks! I new the stuff about Pentium 4 to Core switch, but I didn't really know what was exactly causing the trouble. I guess it's Nintendo being Nintendo again, as it sounds like a CPU that requires very little code optimization.

Well, I think Nintendo's desire for 100% BC is the main factor. Rejigging a more modern core (say a 476FP) for full Broadway BC would probably be more hassle than it's worth, and it seems they didn't want to go the heterogenous cores route.
 
I'm a little bit profane to CPU architecture. Does a shorter pipeline mean higher "performance per megahertz" to put it in some simplified way? Or it requires extra expertise from the coders to get advantage from the shorter pipeline?

Has already been answered, but a short explanation to what pipelining is:

Imagine a CPU instruction (for example the multiplication of two values).
Without pipelining, the whole instruction is executed in one clock cycle.

With pipelining, the execution of instructions is seperated into several stages. The more stages you have, the less work needs to be done per stage; and the less work there is to be done per stage, the higher the stages' clock rates can be set (that's why P4's long pipeline made such high clock rates possible that early).
Each single instruction still needs to run through all the stages of course (giving no performance advantage so far). The point of the matter is: As soon as the instruction arrives in the second stage of the pipeline, the following instruction can already be pushed into the first stage, and so on. In the end, when the pipeline is filled, each following clock cycle one instruction will be finished (with each clock cycle, each instruction is pushed one step further into the pipeline). Which in best case gives you the full advantage of the higher clock rates.

As you can imagine, this is much more complex and not without drawbacks in reality. Each pipeline stage introduces a bit of overhead. And sometimes (or quite often, depending on the application), the pipeline needs to be stalled or large portions of it cleared. Catchwords if you want to read up are: Hazards (data, structural, control), branch prediction.
 

beje

Banned
Has already been answered, but a short explanation to what pipelining is:

Imagine a CPU instruction (for example the multiplication of two values).
Without pipelining, the whole instruction is executed in one clock cycle.

With pipelining, the execution of instructions is seperated into several stages. The more stages you have, the less work needs to be done per stage; and the less work there is to be done per stage, the higher the stages' clock rates can be set (that's why P4's long pipeline made such high clock rates possible that early).
Each single instruction still needs to run through all the stages of course (giving no performance advantage so far). The point of the matter is: As soon as the instruction arrives in the second stage of the pipeline, the following instruction can already be pushed into the first stage, and so on. In the end, when the pipeline is filled, each following clock cycle one instruction will be finished (with each clock cycle, each instruction is pushed one step further into the pipeline). Which in best case gives you the full advantage of the higher clock rates.

As you can imagine, this is much more complex and not without drawbacks in reality. Each pipeline stage introduces a bit of overhead. And sometimes (or quite often, depending on the application), the pipeline needs to be stalled or large portions of it cleared. Catchwords if you want to read up are: Hazards (data, structural, control), branch prediction.

Oh, I understand. So one of the most common drawbacks of a long pipeline I guess would be waiting for the result of an operation sent to the pipeline and still stuck in there
 

blu

Wants the largest console games publisher to avoid Nintendo's platforms.
No real idea, other than maybe it was something to do with the typical memory access pattern for this work. Just trying to think about why they'd bother to even make it asymmetrical. Do you have any ideas on why they would even do this?

Another possibility is that is the core that the OS will preempt, and games don't have access to all of the cache (to keep the OS thread from screwing with the cache for the game thread).
The cache asymmetry being related to a hypothetical OS-dedicated core was one of the earliest theories that got speculated about in some of the earlier WUSTs. My current opinion, though, is that the cacheful core is just as accessible to games as the rest of the cores. It is just the go-to core for any task that simply needs hefty cache - as simple as that (yes, it did take me some pondering to reach to that opinion).

It's actually about as short a pipeline as you could possibly get, and even though it's only running at 1.25GHz, Espresso is probably the highest-clocked CPU with a 4 stage pipeline you're ever likely to see.
Indeed.

Oh, I understand. So one of the most common drawbacks of a long pipeline I guess would be waiting for the result of an operation sent to the pipeline and still stuck in there
It's more complicated than that - most contemporary pipelines have early-out mechanisms for various ops, allowing for the results of various computations to be fed back into the pipeline without having to send those down the entire pipeline length. But generally yes - the shorter the pipeline, the easier it is to get it to a sustained level of performance closer to optimal - operations' latencies are better, penalties - smaller, etc. Last but not least, it's easier for compilers to produce better code.
 

Thraktor

Member
The cache asymmetry being related to a hypothetical OS-dedicated core was one of the earliest theories that got speculated about in some of the earlier WUSTs. My current opinion, though, is that the cacheful core is just as accessible to games as the rest of the cores. It is just the go-to core for any task that simply needs hefty cache - as simple as that (yes, it did take me some pondering to reach to that opinion).

It does seem you're right on this one, but I still think 2MB is a lot of cache for one single-threaded core on a system with one gig of game-accessible RAM. That nearly matches the cache-crazy Power7+, which has 2.5MB of L3 per hardware thread (80MB in all), and is designed for running through data sets scaling up to the petabytes. And outside of IBM's server chips (their brand-new 32nm ones at that), nothing I can find is close to 2MB of cache per thread.

Of course, that's not to say it's not potentially useful, and I'm sure something like Autodesk Kynapse will run very nicely on that core. It's just a little odd, is all.
 
Is the power usage being recorded at plug? If it is, wouldn't that mean that the system uses even less power due to PSU inefficiency?
 

AzaK

Member
It does seem you're right on this one, but I still think 2MB is a lot of cache for one single-threaded core on a system with one gig of game-accessible RAM. That nearly matches the cache-crazy Power7+, which has 2.5MB of L3 per hardware thread (80MB in all), and is designed for running through data sets scaling up to the petabytes. And outside of IBM's server chips (their brand-new 32nm ones at that), nothing I can find is close to 2MB of cache per thread.

Of course, that's not to say it's not potentially useful, and I'm sure something like Autodesk Kynapse will run very nicely on that core. It's just a little odd, is all.

Noob question incoming....Could this cache be accessible (even partitioned) by GPU at all so that CL stuff could be placed on it, or would that typically go into EDRAM?
 

Thraktor

Member
Noob question incoming....Could this cache be accessible (even partitioned) by GPU at all so that CL stuff could be placed on it, or would that typically go into EDRAM?

In theory yes (Xenos can apparently do that with Xenon's cache), but there's really no point in hogging the CPU's cache when you've got 32MB of eDRAM right there on die next to you.
 

Earendil

Member
So I'm guessing that if a processor with a long pipeline stalls, this has a greater overall affect on performance than if a processor with a short pipeline stalls. Am I correct in assuming that?
 

MDX

Member
So I'm guessing that if a processor with a long pipeline stalls, this has a greater overall affect on performance than if a processor with a short pipeline stalls. Am I correct in assuming that?

You mean something like this:


The situation is called "branch mis-prediction." The entire pipeline must be reloaded and all the steps done on those 30 instructions will be lost.Furthermore, the power required to execute those 30 instructions will be wasted as heat.The situation will be even more wasteful if a memory access was required by those lost instructions.
 

AzaK

Member
In theory yes (Xenos can apparently do that with Xenon's cache), but there's really no point in hogging the CPU's cache when you've got 32MB of eDRAM right there on die next to you.

I'm not sure how contention would be manage between GPU and CPU if this was done but I assume it'd be a really bad idea if you want performant code. I was just wondering if it'd then free up another Meg or so in the eDRAM for textures and drawlists etc. Wii U seems to need as much as it can get of everything :)
 

ozfunghi

Member
There was another instance that measured the power of the Wii U the day after US launch. Don't remember what it was, but it stated just running the dashboard consumed only 1 W less than ZombiU at 32 or 33 W. Obviously you would expect that difference to be greater, even if that game isn't going to win any prizes for good looks.
 

JohnB

Member
Power ISA. More specifically the instruction set of the PPC 750CL, or some superset thereof.

Thanks; this thread has taken me (way) back in time when I did Z80 assembler - and reminds me of how chips can be used to produce great stuff.
 

wsippel

Banned
There was another instance that measured the power of the Wii U the day after US launch. Don't remember what it was, but it stated just running the dashboard consumed only 1 W less than ZombiU at 32 or 33 W. Obviously you would expect that difference to be greater, even if that game isn't going to win any prizes for good looks.
And that 1W difference probably comes from mass storage access. I'm pretty sure power management was in one of the Linkedin profiles I've found a while ago, so my guess is that it simply isn't enabled in the current firmware.
 
Top Bottom