• Hey, guest user. Hope you're enjoying NeoGAF! Have you considered registering for an account? Come join us and add your take to the daily discourse.

WiiU technical discussion (serious discussions welcome)

OniShiro

Banned
Matt, when you said the GPU clock was "a little slower" than 600Mhz... was that to be taken literally or not? Many seem to think you really meant "little" when you said "little" and are hoping for it to be somewhere between 550-599 Mhz for instance. While it could be interpreted as "it's going to be slower, guys" which could mean anything as low as 400 Mhz.

The console uses 35W. It will probably be less than 550Mhz. An 4870M at 600Mhz uses 65W.
 
Direction split? So it can't read at the full BW (20 odd GB/s). Is the BW fixed for read and fixed for write at done portion of the total?


And Matt, Lherre, that wasn't fair to do to wsippel :)

Or me! I've wasted alot of time assuming multipliers as well! :p
 

Durante

Member
Thanks for the summary, it puts the Wii U under an interesting lens. I am intrigued by the fact that MEM1 is not just for framebuffer purposes and that local render targets and other operations can be performed without having to spend time and bandwidth resolving out to main RAM. What you are saying there implies that the front display buffer resides in MEM2 though, correct (like on GCN and Wii)? The CPu and GPU both being able to write and read data at high speed and low latency on MEM1 could open very interesting possibilities, unless there is some weird bottleneck.

When FlexIO specs were revealed, many coders were salivating at the prospect of CPU and GPU talking over a fast two way buss with over 20 GB/s of bandwidth each way... then it was revealed that the CPU could not read or write anywhere near that speed, but only the GPU could... you seem to be saying that this is not the case on Wii U, knowing Nintendo's history it should not be the case.
Yeah, it would be really important to get GPU <-> eDRAM and CPU <-> eDRAM bandwidth and latency (or at least some idea of whether they are similar or an order of magnitude different) to envisage possible usage scenarios.

What currently stumps me with regard to the GPU's connection to the eDRAM is the fact that some games slow down in scenes with heavy alpha blending. Given that I expect even the worst ports to put their framebuffers into the eDRAM I wonder what might cause this.
 
Considering the die size, 320 seems really low, or a waste of space.

I'm of the opinion that there is much more eDRAM overhead there that we are not accounting for. It's also possible that there is some redundant logic on the chip to account for yields.

Blu, I'd be interested in hearing your thoughts on the eDRAM config. You were once kind enough to give me a thorough explanation of how Flipper's texture cache acted to reduce latency. Given your knowledge of the subject, might it be necessary to divide Wii U's eDRAM into 32 macros in order to emulate the texture cache in BC mode?

Edit: But perhaps we have evidence to the contrary. Wii U games are reportedly experiencing slowdown in scenes utilizing alpha blending. Might this point towards a slower connection than the one between Xenos' eDRAM and ROPS?
 

Thraktor

Member
I believe the NXP chip was an interesting possibility, but BC does kind of throw a monkey wrench into that idea. Could Nintendo have just bought the IP from Macronix at some point and be free to make their own changes?

Yep, it's even possible that Nintendo owned the IP outright from the start, as it was apparently a highly customised design.

This definitely leaves some more possibilities open. It did somewhat irk me that at if Nintendo stuck to clean multipliers, it would seem to arbitrarily keep some clocks extra low (RAM, CPU mainly), and only because you're working with the least important part of your system (DSP) as a base?. I disagree with your assessment that there are 640 ALUs on the GPU (I think half that more likely), but when I lowered my expectations in that regard, I figured they'd be able to crank the clock slightly higher (close to 600 Mhz). Just to grab a number out of the air, 575 Mhz with 320 SPUs would get you to 368 GFLOPS, just about 1.5x Xenos.

As already mentioned, 320 would result in a much smaller die, even with massive eDRAM overhead and some seriously transistor-intensive customisation of the GPU architecture. It's possible that we're looking at 480, if Nintendo have added a lot of transistors in there. Though I still think a RV740 base is likely; have a look at this RV770 die shot (I couldn't find one for the RV740). Removing the GDDR5 and PCIe interfaces from the R700 architecture frees quite a bit of space on the die.

The system RAM, I'd run as high as possible (800 Mhz) and the CPU twice that for a clean 1.6 Ghz. As for the eDRAM, I'd still be surprised if it wasn't running at the same speed as the GPU, but perhaps it's also running at 800 Mhz in order to match the bandwidth of the main memory pool (assuming there is a 128-bit connection from CPU to GPU). Just a huge guess here, but wouldn't it be better for the eDRAM to have to wait around for the GPU a few extra cycles (and it can even be serving the CPU in that time) than the other way around?

My point on the identical speeds of the DDR3 and eDRAM is that you could have a single memory controller, operating at 800MHz, controlling access to both. Thus the CPU could just have a bus to that memory controller, rather than two separate busses for MEM1 and MEM2. Similarly the ARM cores and DSP could run all memory access through the same controller.

On the synchronised clocks side of things, there's very little reason for the CPU and DDR3 to be highly synchronised. Total random access latency for DDR3 is somewhere around 50ns. If you have CPU and RAM on asynchronous clocks, you increase that latency to perhaps 51ns. It's not a big enough difference to make artificially reducing your CPU clock worthwhile.

With eDRAM, though, the benefit is (largely) in the very low latencies involved. If you had eDRAM which operated at single-cycle latency, then asynchronous clocks between GPU and eDRAM could double that to two cycles (even if the eDRAM's operating at a higher clock), which is a significant increase. This isn't going to be much of an issue for the eDRAM's use as a framebuffer, which is largely concerned with high-bandwidth writes, but once you start pushing compute loads to the GPU it could certainly become an issue.

Consider, for example, Unreal Engine 4. UE4's main innovation is a lighting system based on sparse voxel octree global illumination (SVOGI). The final part of SVOGI consists of running cone-traces over the octree to determine the second-bounce illumination over the scene. This is heavily latency-bound code which is intended to be run on the GPU, and it's occurred to me recently that, if it is true that Epic have decided to support Wii U with UE4, it's likely because they've figured out a way to keep chunks of the octree in the eDRAM during the cone-traces, benefitting from the incredibly low latency the eDRAM provides. Asynchronous GPU/eDRAM clocks would give extra bandwidth for traditional GPU tasks, but it would significantly hinder latency-bound GPU code like this, which is likely to become more and more common as the generation progresses.

What currently stumps me with regard to the GPU's connection to the eDRAM is the fact that some games slow down in scenes with heavy alpha blending. Given that I expect even the worst ports to put their framebuffers into the eDRAM I wonder what might cause this.

Is it possible that this is something as simple as a poorly-optimised API?

Edit: But perhaps we have evidence to the contrary. Wii U games are reportedly experiencing slowdown in scenes utilizing alpha blending. Might this point towards a slower connection than the one between Xenos' eDRAM and ROPS?

The bandwidth between the GPU and daughter die is much more important in Xenos, it's the bottleneck that Wii U's eDRAM would have to exceed. Based on my calculations from above (and assuming the link between GPU and ROPs/eDRAM is 32GB/s on Xenos, which I know is disputed), the worst case scenario is a 60% increase in bandwidth over that.
 

MDX

Member
Seems like Nintendo is just throwing everybody curve balls.

Whatever technology or new console design Nintendo has come up with
they probably had to take into consideration how it would scale for the next console.

They probably wont design their next console from scratch again until 4K becomes
the norm.
 

Panajev2001a

GAF's Pleasant Genius
Ok, a heretical thought occurred to me on my way back home from work today re the asymmetric caches. I think I'll toss it here.

What if one of the CPUs has so much more cache than the others because.. it just makes sense from an application's point of view? Let me explain.

Imagine you have a fixed silicon budget - you have 3 cores, and a fixed amount of cache to go with those at a given fab node. Your first thought would be to make it all symmetrical, yes? Well, perhaps not. Because you know two things:

1. Your system RAM is of less-than-stellar characteristics (and even if BW was much better, latency would not have been).
2. More often than not, games threat their multiple cores as domain workers - e.g. core 0 does the physics, core 1 does the draw calls, core 2 does the game logic, core 3 does the sound (blessed be those DSP-less systems), etc. Now, of those domains, not all workloads tend to have equal memory access patterns - some of those can do with a more streaming fashion access patterns (i.e. easier to handle with smaller caches), whereas some others can be the caches worst nightmare. By making things absolutely symmetrical you'd give some of those domains more cache than they'd actually need, while at the same time depriving other domains from the cache they could make good use of.

So, think again - would you be going symmetrical?

With only three cores and no SMT for any of them, I would think going with asymmetric caches could be a very smart thing if you cannot afford a more decent cache hierarchy for each core of your SMP solution. This is the kernel of engineering delivering a result with the resources you have not the ones you would like to have.
 

AlStrong

Member
Yeah, I didn't really know what, if any, overhead was required. It would certainly affect the calculations, although it's hard to say by how much without numbers from Renesas.

Overhead should be somewhat similar since you're mostly just making room for the read/write ports, power supply, word line/column decode etc.

I would imagine the 64Mb/256bit macros would maximise density/minimise overhead.
The problem with that though is that you're then limiting the bus-connection. You're still getting overhead for the voltage & access. Beyond that there's going to be some design accommodations for a larger array in terms of power anyway (accessing something within a 256K tall column is going to be a bit more difficult than something much shorter); eDRAM integration is going to be more than just the array anyway; lots of design considerations. :)

edit: There will be implications to performance, as well - capacitance/reliability considerations @ given clock. You'd probably want a fair chunk of redundancy too... They'd still be overhead to accommodate access to the other relevant processors (multiple ports for simultaneous access).

Ultimately, I doubt you'd get near the quoted cell density overall (for the space considered to be "eDRAM" vs GPU-proper etc).

Ah, thanks.
 

AlStrong

Member
Is it possible that this is something as simple as a poorly-optimised API?

Not likely. Since most console games are using lower res alpha, they would have to completely botch access to the depth buffer, like DX9, and then some (how RTs are handled). :p Shouldn't be an issue for a console.

It's a fairly straight forward operation. Draw scene, downsize depth, draw particles to the off-screen buffer, compare, upscale & merge back into the scene.

Otherwise, the effect of resolution is pretty much linear in a high bandwidth scenario (fillrate, pixel shading costs are next), but you're reducing bandwidth consumption anyway when reducing particle res.
 

Argyle

Member
Cache thrashing is equivalent to core idling. Particularly on cores where the context switch is not such a heavy op, the waiting for the re-populating of the caches with the relevant data can be anything but cheap. Gone are the times when ALU prowess would win the day - sustained CPU throughput is primarily about data access patterns today, ALU coming second.

While this is correct, I think that the thousands of cycles lost to cache thrashing (and I don't disagree that this can add up, a classic death by a thousand cuts scenario) is not as bad as the millions of cycles lost when your physics core goes idle for lack of work and your AI core ends up bottlenecking the frame. We may have to agree to disagree here.

Avoiding hitting main mem is definitely a sound priority, but by going ps3-style on the WiiU you would under-utilize the platform - as you note WiiU cores are not SPUs - the latter have much less local memory, and much higher ALU rates (not to mention I don't expect U-CPU's conduit to GPU's eDRAM to hit SPU scratchpad levels of performance in either BW or latency). IMO WiiU won't be the ideal port buddy for ps3-tailored pipelines. But then again, I don't see many platforms that would be good ps3 port buddies (yes, I'm aware of the Frostbite pipeline, no, I don't expect it to run equally well on the ps3 and 360 without a good doze of platform specialization).

The point of starting with the PS3 code is that the data locality problem has already been solved, as the data has already been parceled out into small chunks that can easily fit into cache or EDRAM scratchpad. It may not be necessary to even use the EDRAM scratchpad if each core's cache can be controlled effectively, although it may still be beneficial to preload the next job's data into scratchpad (as I'm assuming it's going to be quicker to warm the cache from EDRAM than main mem) if if you already know what the next job is going to be.

Honestly though (and don't take this the wrong way) - I'm not really seeing a counter-proposal from you here, so if this underutilizes the platform (how?), can you explain to me how you would ideally break up the work for the Wii U?

No, cores have their own L2 caches. You might be able to effectively use GPU's eDRAM as L3 shared cache scratchpad, which I'd argue would be a better approach than playing SPU-style isolated scratchpads in the same pool.

We may also be trying to answer different questions in our head here. I guess I am hearing that all the multiplatform games run horribly, which shocked me because I guess I bought into the hype that it would be a little bit more powerful than PS3/Xbox 360. So I guess I am coming from it from the question of "well, if you had to port an existing engine over" which is pretty much the situation that everyone who is not working at Nintendo is in, what would you do?

If you wanted to write something that only ran on the Wii U, sure, I don't disagree that would be a better approach. But again, unless you're Nintendo or you're working on a contract for Nintendo, I'm not sure who is going to be in that situation.

What currently stumps me with regard to the GPU's connection to the eDRAM is the fact that some games slow down in scenes with heavy alpha blending. Given that I expect even the worst ports to put their framebuffers into the eDRAM I wonder what might cause this.

Here's a thought - what if it's not the GPU that is the bottleneck in that situation after all? Usually the thing that causes alpha blending madness is a ton of particles being drawn because something is blowing up, maybe it is actually the CPU choking on all the particles? (I'm on vacation = on crappy hotel internet, have not looked at the videos illustrating the slowdown in various games to see what is going on)
 

Matt

Member
Matt, when you said the GPU clock was "a little slower" than 600Mhz... was that to be taken literally or not? Many seem to think you really meant "little" when you said "little" and are hoping for it to be somewhere between 550-599 Mhz for instance. While it could be interpreted as "it's going to be slower, guys" which could mean anything as low as 400 Mhz.

I chose my words carefully.
 
So if Orbis/Durango are the PSVita, does that make the Wii U the 3DS? If so, that's fine by me. Also, has it been "decided" how many SPUs it has? I'm guessing 480 (96 groups of 5D shaders) and a clock speed of around 525MHz. On a semi-related note, how big is the die size of the Radeon HD 4770? I know that the Radeon Mobility HD 4830 (which is also R740) is only 136.89mm2 and has the same number of shaders as the 4770.
 

Thraktor

Member
So if Orbis/Durango are the PSVita, does that make the Wii U the 3DS? If so, that's fine by me. Also, has it been "decided" how many SPUs it has? I'm guessing 480 (96 groups of 5D shaders) and a clock speed of around 525MHz. On a semi-related note, how big is the die size of the Radeon HD 4770? I know that the Radeon Mobility HD 4830 (which is also R740) is only 136.89mm2 and has the same number of shaders as the 4770.

The 4770 and 4830M (and 4860M for that matter) are all the same die, codenamed RV740. Hence, they're all 137mm².

On the subject of SPUs, it's far from decided. Further up this page I posted a calculation based on the die size of the GPU that lead me to believe that it's a 640 SPU part, and based on the RV740 in particular. However, the facts before me seem to have changed, so I must change my mind. Given info from AlStrong on the eDRAM overhead, and Matt's choice of words regarding the clock speeds, I now think something like 480 would be a more likely bet.
 
so, to a layman, how's the Wii U look under the hood, is it really as bad as some are making it out to be?

Quite honestly, we don't know. We only have vague ideas such as "the CPU doesn't have much raw power, so it will need some master-optimization in order to get running" and "the GPU is, so far, pretty good but performance is bottlenecked by the CPU" and the RAM is just "meh".
 

Reallink

Member
Real talk here guys, how far is optimization really going to get us? I mean it takes like 20 seconds to boot the browser and 30 or 40 to load video apps. Are we ever really going to see that improved to the 1-5 second range most people expect from a $350 box of 2012 technology? Is it even conceivable that Nintendo would go to market with software THAT poorly optimized, or do you think performance is at such a (low) point that a lot of it has to be down to an anemic CPU or whatever.
 

USC-fan

Banned
No way a gpu is using 30 watts of the 34 watts the system is using. That is impossible. In console most game use the same amount of power. R700 at 40nm is 12 glfops per watt and if you move to 28nm you may move to 15. Really high end the wiiu gpu is using 25 watts so around 300glfops at 40nm and 375 glfop at 28nm. At best you are looking at about 1.5x ps360. But you also have to factor in the second screen.

The biggest increase in performance is most likely not shown on paper with the gpu.

Real talk here guys, how far is optimization really going to get us? I mean it takes like 20 seconds to boot the browser and 30 or 40 to load video apps. Are we ever really going to see that improved to the 1-5 second range most people expect from a $350 box of 2012 technology? Is it even conceivable that Nintendo would go to market with software THAT poorly optimized, or do you think performance is at such a (low) point that a lot of it has to be down to an anemic CPU or whatever.
No one can really answer that. If the bottleneck is the flash storage, then there is no way around that. You are only fast as your slowest part.
 

JordanN

Banned
Quite honestly, we don't know. We only have vague ideas such as "the CPU doesn't have much raw power, so it will need some master-optimization in order to get running" and "the GPU is, so far, pretty good but performance is bottlenecked by the CPU" and the RAM is just "meh".
Assuming your reasons are what I think they are, I think it can't be stressed enough by Shin'en the Wii U can't be looked at by face value alone.


"The CPU and GPU are a good match. As said before, today’s hardware has bottlenecks with memory throughput when you don’t care about your coding style and data layout. This is true for any hardware and can’t be only cured by throwing more megahertz and cores on it. Fortunately Nintendo made very wise choices for cache layout, ram latency and ram size to work against these pitfalls. Also Nintendo took care that other components like the Wii U GamePad screen streaming, or the built-in camera don’t put a burden on the CPU or GPU."
 

AzaK

Member
I chose my words carefully.
Hmm. We do appreciate any info you have Matt, but if you're coming in anyway to risk your job or some such, are you able to let us know the clock?

But from what you said it's sounding quite nice actually.

Real talk here guys, how far is optimization really going to get us? I mean it takes like 20 seconds to boot the browser and 30 or 40 to load video apps. Are we ever really going to see that improved to the 1-5 second range most people expect from a $350 box of 2012 technology? Is it even conceivable that Nintendo would go to market with software THAT poorly optimized, or do you think performance is at such a (low) point that a lot of it has to be down to an anemic CPU or whatever.
Yeah this is my concern. The load times are horrendous, and I really don't know what the hell they are doing. I hope it's not bad Flash because if that's the case it ain't really going to get much better, but I'd be very surprised if Nintendo let shit that bad get through. I guess it's possible Nintendo are running debug code or decompressing everything to save space but that's probably a stretch.

It's bad though, real bad, and it will severely hamper my enjoyment of it if it remains. It will make some of the selling points a real chore to use.
 

Kai Dracon

Writing a dinosaur space opera symphony
Assuming your reasons are what I think they are, I think it can't be stressed enough by Shin'en the Wii U can't be looked at by face value alone.


"The CPU and GPU are a good match. As said before, today’s hardware has bottlenecks with memory throughput when you don’t care about your coding style and data layout. This is true for any hardware and can’t be only cured by throwing more megahertz and cores on it. Fortunately Nintendo made very wise choices for cache layout, ram latency and ram size to work against these pitfalls. Also Nintendo took care that other components like the Wii U GamePad screen streaming, or the built-in camera don’t put a burden on the CPU or GPU."

At the end of the day, after all the vetching and hand wringing, Manfred / Shin'en has a real game out on Wii U. That looks amazing, runs at a locked, high frame rate. Shin'en has a proven track record for ripping through the nuts and bolts of whatever hardware they're working with.

In short, Manfred's comments make sense if you assume that Nintendo is not completely idiotic and made up of dumbdumbs who can't design any kind of useful hardware.

The problem, is that a whole lot of people seem to instantly buy into the notion that Nintendo is entirely stupid and there can be no logic or reason to any of their decisions.
 

AzaK

Member
At the end of the day, after all the vetching and hand wringing, Manfred / Shin'en has a real game out on Wii U. That looks amazing, runs at a locked, high frame rate. Shin'en has a proven track record for ripping through the nuts and bolts of whatever hardware they're working with.

In short, Manfred's comments make sense if you assume that Nintendo is not completely idiotic and made up of dumbdumbs who can't design any kind of useful hardware.

The problem, is that a whole lot of people seem to instantly buy into the notion that Nintendo is entirely stupid and there can be no logic or reason to any of their decisions.

Yup, but those people just won't let it go. Also the Trine 2 devs have said that their additional content for the Wii U director's cut couldn't quite work on current consoles.
 

Gahiggidy

My aunt & uncle run a Mom & Pop store, "The Gamecube Hut", and sold 80k WiiU within minutes of opening.
Wow, I should have kept up with these threads. I feel totally behind on the Wii U speculation talk.
 
So if I am reading it right, there is a post on Beyond 3D which seems like actual insider info stating that GPU7's configuration is unique within its family of cards. Ideas, anyone?
 

TunaLover

Member
So if I am reading it right, there is a post on Beyond 3D which seems like actual insider info stating that GPU7's configuration is unique within its family of cards. Ideas, anyone?
What is this post?
I've been so lost about harware specs lately, it seems we got some new hints?
 

sinnergy

Member
At the end of the day, after all the vetching and hand wringing, Manfred / Shin'en has a real game out on Wii U. That looks amazing, runs at a locked, high frame rate. Shin'en has a proven track record for ripping through the nuts and bolts of whatever hardware they're working with.

In short, Manfred's comments make sense if you assume that Nintendo is not completely idiotic and made up of dumbdumbs who can't design any kind of useful hardware.

The problem, is that a whole lot of people seem to instantly buy into the notion that Nintendo is entirely stupid and there can be no logic or reason to any of their decisions.

This. End.
 

NBtoaster

Member
At the end of the day, after all the vetching and hand wringing, Manfred / Shin'en has a real game out on Wii U. That looks amazing, runs at a locked, high frame rate. Shin'en has a proven track record for ripping through the nuts and bolts of whatever hardware they're working with.

In short, Manfred's comments make sense if you assume that Nintendo is not completely idiotic and made up of dumbdumbs who can't design any kind of useful hardware.

The problem, is that a whole lot of people seem to instantly buy into the notion that Nintendo is entirely stupid and there can be no logic or reason to any of their decisions.

There are alot of problems with the assumption that exclusives looking good must mean that all games are equally capable of looking good on the hardware. Every single game and engine has different demands. There are thousands of different graphics and optimisation techniques that have different limitations, conflicts and requirements.

Exclusives will naturally exploit strengths of the hardware while attempting to reduce the impact of weaknesses. But ports won't, because it may limit game design or draw attention to weaknesses on other hardware, or be too time consuming or costly. Not every problem can be completely optimised away either.


Inevitably a 3D Mario or Zelda is going to come out on the Wii U and will look and run gorgeous. But they will not invalidate every problem other devs have had with the hardware unless they've been designed exactly the same way.
 
What is this post?
I've been so lost about harware specs lately, it seems we got some new hints?
http://forum.beyond3d.com/showpost.php?p=1682228&postcount=3523

Yup, new hints: Not an existing R700 configuration. Probably closer to 600 Mhz than 500 (if I read Matt's post right). And the clocks are not synched, so in theorizing GPU clock speed, we don't have to worry about how that would affect the CPU or RAM clocks, since they're not going the clean multiples route.

I think that's just saying that it's not an exact match to an existing chip. So even if the RV740 was the starting point, it's not 640:32:16 shaders/TMUs/ROPs. Not really new info, just warning people off who are focusing on trying to match it exactly to an off the shelf part instead of just using them for reference.

EDIT: what the hell happened to this thread over the last dozen posts?

I think it does allow us to rule some things out. The RV730 was a popular candidate recently as was the RV770, but if that poster speaks the truth, both of those core configurations can be eliminated.

Typical Nintendo. I can see them going with some oddball hitherto unseen assembly of shaders, TMUs, and ROPS. Maybe something like 360:36:12.
 

z0m3le

Banned
No way a gpu is using 30 watts of the 34 watts the system is using. That is impossible. In console most game use the same amount of power. R700 at 40nm is 12 glfops per watt and if you move to 28nm you may move to 15. Really high end the wiiu gpu is using 25 watts so around 300glfops at 40nm and 375 glfop at 28nm. At best you are looking at about 1.5x ps360. But you also have to factor in the second screen.

The biggest increase in performance is most likely not shown on paper with the gpu.

No one can really answer that. If the bottleneck is the flash storage, then there is no way around that. You are only fast as your slowest part.

I'm just going to talk about wattage performance of the HD4800 series, I don't really want to argue about if Nintendo is using this or not, since the Wii U is out, what we decide here won't make a bit of difference, and this is all speculation anyways, thanks ahead of time.

So, HD4870 is 8GFLOPs/watt and runs 750MHz, HD4830 (desktop) is 7.75GFLOPs/watt and uses 640 shaders. These are both 55nm parts, moving to 40nm for HD4770 allowed it to reach 12GFLOPs/watt and runs 640 shaders at 750MHz. This means an increase of 50% power efficiency, now lets look at the mobile end.

http://www.notebookcheck.net/AMD-ATI-Mobility-Radeon-HD-4850.13975.0.html
AMD has claimed a current consumption of about 45-65 Watt for the cards part of the mobile HD 4850 to HD 4870 range.

Those are 55nm parts, HD4850m is 800GFLOPs @ 45w making it about 17.8GFLOPs per watt. So using the HD4830 (a 40nm part) would yeild 26.7GFLOPs per watt, and at 25 watts = ~668GFLOPs

This is what R700 series does in the desktop numbers and assumes the same efficiency scale applies to the mobile parts when moving down to the 40nm process.
 

AzaK

Member
seriously?

Of course. Personally if it gets to the point that someone's essentially telling us, then why not tell us? Up to him, but no harm in asking. I won't lose sleep over it.

Whether we ever know or not I'm taking Matts words positively and it will help focus discussions a bit more.
 
It probably is some oddball mixture. I hope they didn't skimp too much on the ROPs though.

Were ROPS a bottleneck this generation? I've read that these days it's usually the SPUs that are, but I have been wondering if going with 8 ROPS like on Xenos would be enough given the addition of rendering to the Gamepad.
 
We had fillrate problems on the PS3 at 1080p. We were a small shop though, so we couldn't really take advantage of the SPUs other than whatever pre-packaged libraries Sony supplied with the SDK.

I see. Thanks for the reply. I have an inkling Nintendo settled with 8 for some reason, but who really knows at this point (well, besides Matt and lherre). I believe I read somewhere that the ROPS in the R700 series had improved efficiency, though.

Interestingly, the ROPs in those cards are tightly coupled with the L2 cache. If we look at Xenos, I don't think it had an L2. Its ROPS were packed on the daughter die w/ the eDRAM. I've been wondering this for a while now. Might Nintendo just strike the L2 cache and let the eDRAM feed the L1 directly?
 
It's my gut feeling also that they went with a lower number, which is why I mentioned hoping they didn't skimp. The gamepad has a smidgen under half the pixels of 720p, so you'd need 12 to keep parity with the PS360 when taking into account the extra screen. (ignoring any efficiency improvements)

Well, they said that they're supporting 2 Gamepads, right? 320:32:16 is a configuration we have yet to see. That configuration probably doesn't make sense for any other purpose than rendering multiple independent scenes. And actually, I'm not qualified enough to say it even makes sense for that. haha, I think that's enough speculation for one day. Maybe the Wii U specs will come to me in a dream vision.
 

AzaK

Member
Well, they said that they're supporting 2 Gamepads, right? 320:32:16 is a configuration we have yet to see. That configuration probably doesn't make sense for any other purpose than rendering multiple independent scenes. And actually, I'm not qualified enough to say it even makes sense for that. haha, I think that's enough speculation for one day. Maybe the Wii U specs will come to me in a dream vision.

Is there typically a ratio that's adhered to?
 
Is there typically a ratio that's adhered to?

Huh? What? I'm sleeping AzaK!

http://en.wikipedia.org/wiki/Compar...essing_units#Radeon_R700_.28HD_4xxx.29_series

I think it is just one of those things that games are usually bottlenecked by shader operations rather than fillrate, so they scale their cards accordingly.

Edit: Beaten

When supporting two I believe you're getting half the framerate on each. So fillrate would be unaffected.
Right you are. I am under the impression that is more a wireless bandwidth issue, but even so, they wouldn't ever have to render 60fps to each Gamepad.
 

AzaK

Member
^^ & ^

Thanks guys. While I have your undivided attention do any of you have a good source/link etc to understanding modern graphics tech/architecture, EDRAM's role etc? This whole discussion has got me interested in learning some stuff but I don't know where to start. I'm a programmer just not in the graphics space.
 

disap.ed

Member
I'm of the opinion that there is much more eDRAM overhead there that we are not accounting for. It's also possible that there is some redundant logic on the chip to account for yields.

Judging by the numbers of the 1T-SRAM (Link), the overhead is close to 2. So we are speaking of ~30mm² for the eDRAM if the Renesas eDRAM has a bit cell size of 0,06micron. Still leaves ~120mm² for the GPU alone if you keep in mind the ARM cores and the DSP. This is in the ballpark of the Turks chips (118mm², 480:24:8, HD6xxx series but still VLIW5 @ 40nm).

My problem with this is, that with 480 shader units @ close to 600 MHz we are at >550 GFLOPS, which would mean ~20 GFLOPS/Watt. These numbers were achieved, by mobile chips though and so likely quite expensive.
It is possible though that the 40nm process is so mature already, that this isn't a problem anymore.

64Mb/256bit -> 1024bit interface for 32MB -> 51.2GB/s to 102.4GB/s
8Mb/256bit -> 8192bit interface for 32MB -> 409.6GB/s to 819.2GB/s
8Mb/128bit -> 4096bit interface for 32MB -> 204.8GB/s to 409.6GB/s

(The bandwidth ranges are based on a clock range of 400MHz to 800MHz)

Bandwidth of 400-800GB/s would certainly be massive overkill (as a comparison, the highest bandwidth on any currently available consumer GPU is the Radeon 7970's 288GB/s, and that's targeting much higher resolutions than the Wii U's 720p standard). The 1024bit interface at a high clock or the 4096bit at a low clock are probably the most likely, and the clock is likely to be either equal to, or a clean multiple of, the GPU's clock. For reference, the on-die interface between Xenon's ROPs and eDRAM is 4096bit at 500MHz for 256GB/s of bandwidth.

So because the GPU's clock lies right in the middle of the eDRAMs clock range, I guess it is a given that it runs with the same frequency.
This would mean >70GB/s @1024 bit interface, which is nothing to write home about really, but I guess the latency makes the difference.
300 GFLOPS @ the 4096 bit interface would certainly be a whole different story.
 

z0m3le

Banned
Judging by the numbers of the 1T-SRAM (Link), the overhead is close to 2. So we are speaking of ~30mm² for the eDRAM if the Renesas eDRAM has a bit cell size of 0,06micron. Still leaves ~120mm² for the GPU alone if you keep in mind the ARM cores and the DSP. This is in the ballpark of the Turks chips (118mm², 480:24:8, HD6xxx series but still VLIW5 @ 40nm).

My problem with this is, that with 480 shader units @ close to 600 MHz we are at >550 GFLOPS, which would mean ~20 GFLOPS/Watt. These numbers were achieved, by mobile chips though and so likely quite expensive.
It is possible though that the 40nm process is so mature, that this isn't a problem anymore.

Yes, as I posted above, the HD4850m is 45 watts and 800 GFLOPs @ 55nm making it ~17.8GFLOPs/Watt, given the 40nm assumed nature of the chip (could be lower but highly unlikely) 20+GFLOPs/Watt is a given for the mobile process of that series.

It's also worth noting that Wii U sells at a very small loss, so the chance that the GPU is somewhat costly is actually high, flash memory and DDR3 RAM isn't going to cost them much, also the CPU should be fairly cheap as well and after you consider it's MCM, that brings down wattage and costs. So a mobile gpu of the R700 series might actually be the best fit.

Considering the 25-30watt estimation of the GPU (under heavy load) would put the GFLOPs somewhere between these numbers:
25w X 20GFlops = 500 GFLOPs
30w X 20GFlops = 600 GFLOPs
25w X 26GFlops = 650 GFLOPs
30w X 26GFlops = 780 GFLOPs

I'm just doing mobile R700 GPU math @ 40nm, these should be fairly safe assumptions btw, but it doesn't mean Wii U is using a mobile part, just that these numbers are possible for R700 given that these chips actually exist. (HD 4830m uses ~28-30watts /w 768GFLOPs)
 

z0m3le

Banned
I think 500-550 GFLOPS is the closer guess.

Rather than trying to guess the Wii U's GPU, based on basically nothing. I'm trying to guess the range on the only known thing about the GPU which is the wattage. I do have a question though. The HD 4770 uses GDDR5, if that was removed and any other controllers that the chip would use, what would it's power draw be? I'm asking because Wii U doesn't have those things in it's GPU and so it's wattage draw would be better than a desktop GPU anyways right?

Completely random number, but if the extra components take up say 25% of the wattage to the GPU. You would have a GPU that gives you 16GFLOPs per Watt, taking this exact card and only giving it 30watts, would give you 480GFLOPs, so while everything is a guess, something like this would help a bit more with targeting those numbers I'd assume.
 

dumbo

Member
Real talk here guys, how far is optimization really going to get us? I mean it takes like 20 seconds to boot the browser and 30 or 40 to load video apps. Are we ever really going to see that improved to the 1-5 second range most people expect from a $350 box of 2012 technology?

This has little to do with the hardware technology/optimization, and a lot to do with the O/S.

Hardware-wise, the system should be able to load the browser 'instantly'. (just leave an Nmb webkit browser loaded in that massive 1GB of system RAM)

Software-wise, I don't think Nintendo has a particularly long and glorious history of writing modern operating systems :(.

Alternatively, maybe Nintendo actually use that 1GB of system RAM for something else.
 
The problem, is that a whole lot of people seem to instantly buy into the notion that Nintendo is entirely stupid and there can be no logic or reason to any of their decisions.

Except it's the same logic that resulted in many low quality ports for PS3.
What's the point of having some clever features when only your internal studios will have time and budgets to use them ?

And anyway we wouldn't even need this debate if Nintendo weren't obsessed with such a low power target.

World wouldn't end if they designed console with 70-80w tdp but the clocks on CPU/GPU would be much more generous and costs likely would be lower since they wouldn't have to fish for low power parts..
 
Top Bottom