• Hey, guest user. Hope you're enjoying NeoGAF! Have you considered registering for an account? Come join us and add your take to the daily discourse.

WiiU technical discussion (serious discussions welcome)

japtor

Member
That's what's being discussed, some think it's in there as a 1:1 copy, some think its functions are just mixed into the rest of the Wii U GPU. I would think a 1:1 copy would be small enough now to put in there to ensure perfect compatibility.
Well people think it's mixed in cause of this:
http://iwataasks.nintendo.com/interviews/#/wiiu/console/0/2
Shiota Yes. The designers were already incredibly familiar with the Wii, so without getting hung up on the two machines' completely different structures, they came up with ideas we would never have thought of. There were times when you would usually just incorporate both the Wii U and Wii circuits, like 1+1. But instead of just adding like that, they adjusted the new parts added to Wii U so they could be used for Wii as well.

Iwata And that made the semiconductor smaller.
There's probably Wii only guts in there sure, but it sounds like they replicated functionality through the newer circuits as well so it's not just a known 1:1 thing anymore. Could be a little of it is gone, could be a lot, and the new equivalent/replacement stuff could be bigger or smaller than what was there on the Wii.
 

tipoo

Banned
Well people think it's mixed in cause of this:
http://iwataasks.nintendo.com/interviews/#/wiiu/console/0/2

There's probably Wii only guts in there sure, but it sounds like they replicated functionality through the newer circuits as well so it's not just a known 1:1 thing anymore. Could be a little of it is gone, could be a lot, and the new equivalent/replacement stuff could be bigger or smaller than what was there on the Wii.



It's possible, maybe even likely. But what he said could be checked off just by what we know of how the CPU works too, right? In Wii mode it just uses Core 0 with half the cache disabled, no unique Wii core needed. Could be that which he was referring to?
 
I'm more interested in the Tesselation unit, can we find it and through comparison conclude whether it's the Gen2 or Gen3 implementation?
 

Earendil

Member
I'm more interested in the Tesselation unit, can we find it and through comparison conclude whether it's the Gen2 or Gen3 implementation?

Could that be item 'V' on Thraktor's chart? wsippel said that block has been on all of AMD's most recent chips, so it might be the tesselator.
 
I don't really want to have to read through 15 pages worth of arguing and trolling like the first 5 pages of the new thread.

Can someone give me a rundown of what has been confirmed, last i read the GPU was meant to be 352 Gflops rather than 176 Gflops ?, anything else been worked out from the pic ?.

Cheers.
 

tipoo

Banned
I don't really want to have to read through 15 pages worth of arguing and trolling like the first 5 pages of the new thread.

Can someone give me a rundown of what has been confirmed, last i read the GPU was meant to be 352 Gflops rather than 176 Gflops ?, anything else been worked out from the pic ?.

Cheers.

There's an additional bank of 4MB eDRAM which is more tightly packed than the 32MB main one, meaning lower latency. There's also a 1MB SRAM part, which might be even faster, however it's oddly placed in a corner away from the DRAM connectors so it may not be used as a cache, it's closer to the CPU than anything else on the die so it may be a scratchpad between the two (but that's just my personal guess).

There are 16 texture mapping units (TMUs) and 8 ROPs.
 
There's an additional bank of 4MB eDRAM which is more tightly packed than the 32MB main one, meaning lower latency. There's also a 1MB SRAM part, which might be even faster, however it's oddly placed in a corner away from the DRAM connectors so it may not be used as a cache, it's closer to the CPU than anything else on the die so it may be a scratchpad between the two (but that's just my personal guess).

There are 16 texture mapping units (TMUs) and 8 ROPs.

Thanks for the info, also just seen what the chipworks guy said in the OP, sounds promising, certainly not the 'cheap way out' many people seem to have been accusing Nintendo of ! :).
 
So the embedded DRAM is 1T-SRAM? I guess it makes sense, the console needs very low latency RAM in order to run Wii (and Gamecube) games without emulation layer.
 
I think there is some faulty info being spread around here, but that happens when things move this fast.

-It appears the smaller eDRAM pool is 2 MB and not the previously speculated 4.
-Both the 32 MB and 2 MB pools are eDRAM. There is no 1t-SRAM on the die.
-There is an additional 1 MB pool of normal 6 transistor SRAM on the upper left hand side of the die pic, to the left of the smaller eDRAM pool.
 

Schnozberry

Member
MoSys has probably licenced it. See https://twitter.com/marcan42

"32MB 1T-SRAM MEM1, 2MB 1T-SRAM MEM0/EFB, 1MB SRAM ETB"

(note that 1T-SRAM is not SRAM, it's DRAM. As it's dynamic RAM, not static. But it has performance similar to SRAM.)

Or Marcan is simply wrong, which is the case here, since Chipworks said it was EDRAM. Marcan was probably just assuming based on the GameCube and Wii use of 1T-SRAM.
 
Or Marcan is simply wrong, which is the case here, since Chipworks said it was EDRAM. Marcan was probably just assuming based on the GameCube and Wii use of 1T-SRAM.

Sorry abot the late edit in my previous post... But 1T-SRAM is a type of eDRAM, and some manufacturers just call 1T-SRAM eDRAM without going into details. I figured Marcan had run some memory tests, and based his tweet on that.
 

Schnozberry

Member
Sorry abot the late edit in my previous post... But 1T-SRAM is a type of eDRAM, and some manufacturers just call 1T-SRAM eDRAM without going into details. I figured Marcan had run some memory tests, and based his tweet on that.

For both the Wii and GameCube, there were MoSys press releases announcing the licensing of the patent and use in the hardware. I can't find any such release for the Wii U. In fact, that only link for MoSys and Wii U I can find on google are links to GAF Wii U Speculation Threads.
 
For both the Wii and GameCube, there were MoSys press releases announcing the licensing of the patent and use in the hardware. I can't find any such release for the Wii U. In fact, that only link for MoSys and Wii U I can find on google are links to GAF Wii U Speculation Threads.

Well, they had already licenced the technology to NEC/Renesas, why would they make a new press release for the same tech? And the company seems to have moved on anyway, from Wikipedia: "In 2012, MoSys discontinued its IP core businesses in order to concentrate soley on its line of Bandwidth Engine ICs."

edit: Looks like MoSys sold the 1T-SRAM patents to Invensas Corporation. Maybe Nintendo could have chosen an another eDMRAN tech because of that... Well, there must be other eDRAM technologies that give similar or better performance as 1T-SRAM these days, the tech is already quite old.
 
Having said that...

"Héctor Martín @marcan42
By the way, our resident silicon expert agrees that it's 1T-SRAM, not eDRAM (or that it's 1T-SRAM marketed as eDRAM, if you will).

(makes sense, just look at all those tiny banks - that's exactly what 1T-SRAM looks like)"
 

Earendil

Member
Having said that...

"Héctor Martín @marcan42
By the way, our resident silicon expert agrees that it's 1T-SRAM, not eDRAM (or that it's 1T-SRAM marketed as eDRAM, if you will).

(makes sense, just look at all those tiny banks - that's exactly what 1T-SRAM looks like)"

Is he talking about the whole amount of eDRAM on the chip? Or just the little bit in the top left that we're all confused about?
 

OryoN

Member
Well, wasn't the big deal about eDRAM - for IBM at least - was that it uses 1 transistor per cell in order to reach those densities? Wouldn't that effectively be the same idea bedind Mosys's 1T-SRAM(which isn't truly "SRAM" to begin with, just DRAM + enhancements to behave similar, iirc). I vaguely remember reading this in an IBM eDRAM doc. Maybe it's IBM's eDRAM here, though Renesas fabs the GPU?

Also:

Shiota:
The designers were already incredibly familiar with the Wii, so without getting hung up on the two machines' completely different structures, they came up with ideas we would never have thought of. There were times when you would usually just incorporate both the Wii U and Wii circuits, like 1+1. But instead of just adding like that, they adjusted the new parts added to Wii U so they could be used for Wii as well.

Could this be the reason why one of the shader cores (N4) is slightly larger than the rest? Maybe it can also perform TEV-like functions?
 

Donnie

Member
Well, GC was modest on it's chips, but had a great design overall.
WiiU on the other hands has really modest chips (CPU smaller than the one on the GC tells us everything about it) and an awful bottlenecked design. With only 12.8GB/s of total RAM bandwidth it's performance will be between the first Xbox and the Xbox360, and that's all there is on it.

The difference between WiiU and Durango/Orbis will be bigger and more exaggerated than the one existing between the Wii and Xbox360/PS3.

Regards!

The thing is you dont know enough about GameCubes design to realise how much you're contradicting yourself here. Not to mention a complete lack of understanding on what constiutes a well designed console.

You say GameCube made up for its modest raw power with a great design. Then claim WiiU is an awful bottleknecked design because it has less total main memory bandwidth than 360. Yet GameCube had less main memory bandwidth than PS2 and far less than Xbox (PS2 had about 30% more and Xbox about 150% more!).

Main memory bandwidth isnt the be all end all of what makes a well designed system and its only a bottlekneck if the designers of the system allow it to be. Theres a reason why everyone of WiiUs chips are in touching distance of each other on the same package and theres a reason why they have such an extraordinary amount of very fast local memory. To allow those chips to do there jobs with as little access to main memory as possible. GameCube did the same thing to an extent and thats one of the reasons it could do so much with what apeared to be so little. WiiU just takes things a bit further down a very similar path.
 

blu

Wants the largest console games publisher to avoid Nintendo's platforms.
*Random thinking out loud probably not related to the actual Wii U GPU*

If you have all that memory embedded right on the GPU and accessible to the shader units with low latency, do you need conventional ROP hardware at all? Or can you just do blending in the shader like a PowerVR / Tegra chip? Perhaps with mini-rops for Z / stencil test?
That's an interesting supposition, definitely not without a reason.
 

wsippel

Banned
I almost expected this, but the white SRAM looking things are apparently ROMs, so the DSP is located in block X. Also, Marcan thinks Y is Starbucks. What's weird is that the SRAM next to it seems to small for the TCM. No idea. Also, the dark orange SRAM looking things are supposedly dual port SRAM.
 
This has turned out to be absolutely fascinating. Whatever the real-world implications in terms of suitability for multi-platform development, power comparisons with Durango/Orbis etc., it's clear that this is a really interesting little chip.

Good stuff, all.
 

efyu_lemonardo

May I have a cookie?
I almost expected this, but the white SRAM looking things are apparently ROMs, so the DSP is located in block X. Also, Marcan thinks Y is Starbucks. What's weird is that the SRAM next to it seems to small for the TCM. No idea. Also, the dark orange SRAM looking things are supposedly dual port SRAM.

do you mean the things that are in blocks A, B, D and X?
So that would mean those blocks require a higher bandwidth memory than the others?
 
The thing is you dont know enough about GameCubes design to realise how much you're contradicting yourself here. Not to mention a complete lack of understanding on what constiutes a well designed console.

You say GameCube made up for its modest raw power with a great design. Then claim WiiU is an awful bottleknecked design because it has less total main memory bandwidth than 360. Yet GameCube had less main memory bandwidth than PS2 and far less than Xbox (PS2 had about 30% more and Xbox about 150% more!).

Main memory bandwidth isnt the be all end all of what makes a well designed system and its only a bottlekneck if the designers of the system allow it to be. Theres a reason why everyone of WiiUs chips are in touching distance of each other on the same package and theres a reason why they have such an extraordinary amount of very fast local memory. To allow those chips to do there jobs with as little access to main memory as possible. GameCube did the same thing to an extent and thats one of the reasons it could do so much with what apeared to be so little. WiiU just takes things a bit further down a very similar path.
That post of mine was from a few weeks ago and since then I informed myself a LOT about this and yes, it was total bullshit from top to bottom.

I almost expected this, but the white SRAM looking things are apparently ROMs, so the DSP is located in block X. Also, Marcan thinks Y is Starbucks. What's weird is that the SRAM next to it seems to small for the TCM. No idea. Also, the dark orange SRAM looking things are supposedly dual port SRAM.
But if Y is Starbucks, isn't X a bit too large to be the DSP. Or what you mean is that block X contains the DSP + other hardware features, in which case would be better to break block X in two separate blocks? (one for the DSP and the rest for whatever is there).
 

wsippel

Banned
*Random thinking out loud probably not related to the actual Wii U GPU*

If you have all that memory embedded right on the GPU and accessible to the shader units with low latency, do you need conventional ROP hardware at all? Or can you just do blending in the shader like a PowerVR / Tegra chip? Perhaps with mini-rops for Z / stencil test?
Interesting theory. Considering this is Nintendo, not much would surprise me. Maybe even a scanline/ Z hybrid? A scanline renderer would use the shader units much more efficiently as far as I understand, so Nintendo would get away with less shader ALUs. Also, such an approach might possibly explain why there's apparently no tearing in any Wii U game...


do you mean the things that are in blocks A, B, D and X?
So that would mean those blocks require a higher bandwidth memory than the others?
A, B, C, D, F, G, O, R and X, as far as I can tell. No idea what it means, actually.


But if Y is Starbucks, isn't X a bit too large to be the DSP. Or what you mean is that block X contains the DSP + other hardware features, in which case would be better to break block X in two separate blocks? (one for the DSP and the rest for whatever is there).
I don't think X is just a DSP, but the stuff on the right side of that block seems to be DSP logic.
 

efyu_lemonardo

May I have a cookie?
A, B, C, D, F, G, O, R and X, as far as I can tell. No idea what it means, actually.

Ah, ok. thanks for the clarification. trying to learn as I go along. So block R has only dual port SRAM? hopefully that information can be used to identify it later down the line.

Also, might as well ask a monumentally noobish question while I'm at it, and get it out of the way :p
All the empty space we're seeing in between the SRAM on each block, that's due to the process used to photograph the chip? In reality these areas are occupied by a layer of connections that has been removed. Is that accurate at all?
So more empty space would mean more connections?

edit:
*Random thinking out loud probably not related to the actual Wii U GPU*

If you have all that memory embedded right on the GPU and accessible to the shader units with low latency, do you need conventional ROP hardware at all? Or can you just do blending in the shader like a PowerVR / Tegra chip? Perhaps with mini-rops for Z / stencil test?
I don't presume to understand it all, but here is a writeup that seems to examine such an implementation, and the challenges and shortcomings it could create. Scroll down to where is says "Aside: Why no fully programmable blend?" and be sure to look at the comments as well. Hope this helps.
 

ozfunghi

Member
I don't presume to understand it all, but here is a writeup that seems to examine such an implementation, and the challenges and shortcomings it could create. Scroll down to where is says "Aside: Why no fully programmable blend?" Hope this helps.

His reply below also seems interesting (easier for mortals).
 

Donnie

Member
That post of mine was from a few weeks ago and since then I informed myself a LOT about this and yes, it was total bullshit from top to bottom.

Oh sorry, I read the post on my phone and didnt notice the date it was posted. Wouldnt have replied if I had.
 

EDarkness

Member
Well Darksiders 2 has tearing even on the gamepad :(

It does have some tearing, but I honestly believe that they didn't really do much with the Wii U version. Though, having watched my roommate play through the game on the 360, I think the Wii U version looks better.
 

Earendil

Member
I don't presume to understand it all, but here is a writeup that seems to examine such an implementation, and the challenges and shortcomings it could create. Scroll down to where is says "Aside: Why no fully programmable blend?" and be sure to look at the comments as well. Hope this helps.

Thanks for the link. My GPU knowledge is severely outdated (what little I even remember), so I'm going to try and find some time today to read the whole series.

EDIT:

I started reading it and realized that I'm so far behind the curve these days that I'm going to actually start working on something hands-on first. I have Unity installed, is that a good option to get back into game programming? Or is there a better choice?
 

Kai Dracon

Writing a dinosaur space opera symphony
I have actually noticed tearing ONLY on the gamepad, while there was none on the TV (or at least it was much less noticeable). Pretty weird.

Is it possible the tearing that happens in DS2 on the gamepad is actually an issue with the streaming breaking down and not actual tearing in the output of the game engine?

Since the streaming tech is based around software breaking up the image into tiles, I wonder if something about the visuals in DS2 strain it and some tiles are lagging, giving the impression of tearing.
 
Is it possible the tearing that happens in DS2 on the gamepad is actually an issue with the streaming breaking down and not actual tearing in the output of the game engine?

Since the streaming tech is based around software breaking up the image into tiles, I wonder if something about the visuals in DS2 strain it and some tiles are lagging, giving the impression of tearing.

Nah, I can only play it for short spurts on my TV. The tearing gives me a headache. ...eye strain...
 

Thraktor

Member
Tearing on the gamepad occurs when the stream drops packets. If you're getting tearing on the gamepad but not the TV, then try to reduce interference or create a better line of sight between the console and gamepad.
 

blu

Wants the largest console games publisher to avoid Nintendo's platforms.
I think it's time we got our heads clear of all the Latte pondering for a sec (so we can return to it with renewed strengths in a moment).

Some time ago I tried to do a rudimentary analysis of how a ppc750cl compared to a bobcat in a very vector-heavy fp scenario - mat4 multiplication. Unfortunately back then the toolchain on the my wii was not quite up to the task, so the notorious paired singles were left untested (the test was entirely scalar on the ppc, and 4-way SIMD on the bobcat). Well, today I'm able to mend that situation, thanks to some wonderful advancements in the gcc compiler tech and my headscratching with cross-toolchains (buildroot rocks, btw).

new broadway compiler: g++ (Buildroot 2012.11.1) 4.6.3
optimisation options: -fno-rtti -ffast-math -fstrict-aliasing -mpowerpc -mcpu=750 -mpaired -DSIMD_FP32_2WAY -funroll-loops -O3 -DNDEBUG

Long story short, here's the same computation we did back then, taking into account the time that the test took to run on the two platforms, and normalizing that per clock (bobcat@1.33GHz vs ppc750cl@729MHz):

$ echo "scale=4; 6.10496 / 4.53022 / (1333 / 729)" | bc
.7369

I think the above gives some food for thought.
 

blu

Wants the largest console games publisher to avoid Nintendo's platforms.
Hmm.. all of a sudden the PPC's doing pretty well in comparison to the Bobcat, Bobcat's only 73% as powerful as the PPC750cl when normalized by clock. How can this be?
Paired-singles utilized well this time (autivectorization via intrinsic vector types) and overall more efficient code generated by the newer compiler is what constitute the changes at prima vista. I can post the asm code for both cpus, if somebody is interested.
 

Earendil

Member
I think it's time we got our heads clear of all the Latte pondering for a sec (so we can return to it with renewed strengths in a moment).

Some time ago I tried to do a rudimentary analysis of how a ppc750cl compared to a bobcat in a very vector-heavy fp scenario - mat4 multiplication. Unfortunately back then the toolchain on the my wii was not quite up to the task, so the notorious paired singles were left untested (the test was entirely scalar on the ppc, and 4-way SIMD on the bobcat). Well, today I'm able to mend that situation, thanks to some wonderful advancements in the gcc compiler tech and my headscratching with cross-toolchains (buildroot rocks, btw).

new broadway compiler: g++ (Buildroot 2012.11.1) 4.6.3
optimisation options: -fno-rtti -ffast-math -fstrict-aliasing -mpowerpc -mcpu=750 -mpaired -DSIMD_FP32_2WAY -funroll-loops -O3 -DNDEBUG

Long story short, here's the same computation we did back then, taking into account the time that the test took to run on the two platforms, and normalizing that per clock (bobcat@1.33GHz vs ppc750cl@729MHz):

$ echo "scale=4; 6.10496 / 4.53022 / (1333 / 729)" | bc
.7369

I think the above gives some food for thought.

I tried running this through Google Translate, but it just gave me "mubbity mubbity moo". I think I missed something.
 
I think it's time we got our heads clear of all the Latte pondering for a sec (so we can return to it with renewed strengths in a moment).

Some time ago I tried to do a rudimentary analysis of how a ppc750cl compared to a bobcat in a very vector-heavy fp scenario - mat4 multiplication. Unfortunately back then the toolchain on the my wii was not quite up to the task, so the notorious paired singles were left untested (the test was entirely scalar on the ppc, and 4-way SIMD on the bobcat). Well, today I'm able to mend that situation, thanks to some wonderful advancements in the gcc compiler tech and my headscratching with cross-toolchains (buildroot rocks, btw).

new broadway compiler: g++ (Buildroot 2012.11.1) 4.6.3
optimisation options: -fno-rtti -ffast-math -fstrict-aliasing -mpowerpc -mcpu=750 -mpaired -DSIMD_FP32_2WAY -funroll-loops -O3 -DNDEBUG

Long story short, here's the same computation we did back then, taking into account the time that the test took to run on the two platforms, and normalizing that per clock (bobcat@1.33GHz vs ppc750cl@729MHz):

$ echo "scale=4; 6.10496 / 4.53022 / (1333 / 729)" | bc
.7369

I think the above gives some food for thought.


Oooooh now I've gone cross-eyed...
 

Thraktor

Member
I think it's time we got our heads clear of all the Latte pondering for a sec (so we can return to it with renewed strengths in a moment).

Some time ago I tried to do a rudimentary analysis of how a ppc750cl compared to a bobcat in a very vector-heavy fp scenario - mat4 multiplication. Unfortunately back then the toolchain on the my wii was not quite up to the task, so the notorious paired singles were left untested (the test was entirely scalar on the ppc, and 4-way SIMD on the bobcat). Well, today I'm able to mend that situation, thanks to some wonderful advancements in the gcc compiler tech and my headscratching with cross-toolchains (buildroot rocks, btw).

new broadway compiler: g++ (Buildroot 2012.11.1) 4.6.3
optimisation options: -fno-rtti -ffast-math -fstrict-aliasing -mpowerpc -mcpu=750 -mpaired -DSIMD_FP32_2WAY -funroll-loops -O3 -DNDEBUG

Long story short, here's the same computation we did back then, taking into account the time that the test took to run on the two platforms, and normalizing that per clock (bobcat@1.33GHz vs ppc750cl@729MHz):

$ echo "scale=4; 6.10496 / 4.53022 / (1333 / 729)" | bc
.7369

I think the above gives some food for thought.

If I'm not mistaken, doesn't Bobcat do 128-bit SIMD through two passes of a 64-bit ALU? Not to take anything away from your results, but worth remembering in comparison to Jaguar, which has a proper 128-bit SIMD unit.
 

v1oz

Member
This has turned out to be absolutely fascinating. Whatever the real-world implications in terms of suitability for multi-platform development, power comparisons with Durango/Orbis etc., it's clear that this is a really interesting little chip.

Good stuff, all.
What's turned out to be fascinating about from the relative lack of power?
 
Top Bottom