• Hey, guest user. Hope you're enjoying NeoGAF! Have you considered registering for an account? Come join us and add your take to the daily discourse.

WiiU "Latte" GPU Die Photo - GPU Feature Set And Power Analysis

Status
Not open for further replies.

Thraktor

Member
wiiudie_800.jpg


Full-size photo here (very big)

A version of the die photo with each of the components outlined and labelled:


What's going on here?

A few days ago, wsippel noticed that Chipworks had Wii U die photos up for sale on their website, and at $200 a piece (for each of the CPU, GPU and NOR dies), some of us in the Wii U Technical Discussion Thread decided to chip in a few dollars each to buy the GPU photo, with the aim of a few of us (Fourth Storm, wsippel, Blu, Durante and myself) deciphering it and posting up our results on GAF. Chipworks, though, decided to be amazingly kind and helpful, and sent an email back to us offering not only to do a higher quality polysilicon die photo for us at their expense (as they felt their existing shot didn't give us the detail we needed), but to allow us to post the full-res photo up here on GAF for all you lovely folks to enjoy!

(We've learnt from Chipworks that this kind of photo would usually cost about $2500 to do, so it really is incredibly generous of them to do it for us for free.)

***What follows is speculation and deduction based on what we see in the die photo. Our analysis is ongoing, so please don't jump to any hasty conclusions***

What am I looking at?

The die is exactly 11.88 x 12.33mm (146.48mm²). Chipworks believe that it's "fabricated in a 40 nm advanced CMOS process at TSMC". It carries Renesas die markings, but no AMD die markings (although there is an AMD marking on the MCM heat-spreader). This is unexpected, as it was widely reported that the GPU was originally based on AMD's R700 line, and Nintendo publicly referred to it as a Radeon-based GPU. As the die appears to be very highly customised (it looks very different to other R700-based GPUs), the markings (or lack thereof) may indicate that the customisations were not done by AMD, but rather by Nintendo and Renesas.

In addition to the usual GPU components, the die includes a large eDRAM pool (accessible to both CPU and GPU), and it is understood that one or more ARM cores are also on-die, as well as a DSP.

Layout

As can be seen in the labelled version of the die photo above, there are three main sections of the chip, which I will deal with separately. The first of these comprises the memory pools (the two eDRAM pools and one SRAM pool), the second the off-chip interfaces (around the edges of the chip), and the third the GPU logic (the sections labelled A-Y).

Memory

The large orange block on the left is 32MB of eDRAM, known as MEM1. It's 40.72mm², and takes up 27.8% of the die. As this appears to be a non-standard eDRAM configuration from Renesas, the interface (and hence the bandwidth) are not immediately obvious. This pool of memory is accessible from both the CPU and GPU, and would be expected to have a high-bandwidth, low-latency interface to each. The MEM1 pool also serves a purpose in Wii mode, by replacing the 24MB of 1T-SRAM.

The smaller orange block above it is also eDRAM, and is referred to as MEM0. It's 4.25mm², and appears to be 2MB in size. In Wii mode it is used for the embedded framebuffer, and in Wii U mode it "is used as fast general purpose RAM".

To the left of the smaller eDRAM pool is a pool of SRAM, understood to be 1MB in size, and seems to be used as a texture cache in Wii mode. Its purpose in Wii U mode is unclear, possibly also serving as a cache. Its use as a texture cache in Wii U mode would be puzzling, though, as it is on the opposite corner of the die from the DDR3 interface, and seemingly far from the texture units.

Interfaces

The interface running around the lower right corner of the die is the DDR3 memory interface (the DDR3 is known as MEM2).

Running along the top and left sides of the die, along with a small section on the upper right side of the die, are general purpose I/O (GP I/O). The GP I/O is likely dedicated in large part to communication with the CPU, but may also be used for lower-bandwidth off-chip communication, such as the Blu-Ray drive or SD card slot.

On the bottom left of the die there are two high-speed I/O (HS I/O) interfaces, such as SERDES (serialiser/deserialiser), which are used to achieve very high bandwidth over relatively few wires. Proposed applications of these include:

- Communication with the hardware that handles video transmission to the gamepad
- Communication with the CPU (to provide high-bandwidth/low-latency eDRAM access)
- USB interfaces
- SATA interface
- Flash memory interface
- HDMI

There are also two blocks on the right side of the chip above the DDR3 interface that are currently unknown. These may be part of the DDR3 interface, or may be I/O elements in and of themselves.

GPU Logic

(This is somewhat of a misnomer, as there's an ARM CPU and a DSP in there, but we're not certain where either of them are.)

The GPU logic consists of 40 blocks, which are apparently of 25 different types. They are labelled A-Y, and repeated blocks are numbered. The small orange/black units on these blocks are SRAM cells, and the type, quantity and location of the SRAM cells are central clues when it comes to discovering which blocks contain which components.

The blocks labelled N1-N8 appear to contain the SPUs. Judging by their size relative to other 40nm VLIW5 GPUs, it seems that they each contain 40 SPUs, giving a total of 320. As well as the changes in their grouping (VLIW5 SPUs are usually grouped in 20s), there seem to be changes to their register files (the SRAM cells around them).

It is generally assumed that the blocks labelled J1-J4 are the texture unit bundles (four in each for a total of 16). Their location is neither adjacent to the DDR3 interface or the SRAM "cache", which would be unusual, but certainly not impossible, placement for texture units. That they are located next to the SPUs, however, makes sense.

The location of the ROP bundles is unknown, but blocks U1 and U2 have been proposed as possibilities, due to their size and location. It is also possible that the ROP bundles are separated out into their constituent components, so may reside in asymmetrical blocks (or sets of blocks).

The ARM core (referred to as "Starbuck") is believed to be very similar (or even identical) to the "Starlet" ARM core on the Wii's GPU die. As such, it should be very small, possibly <1mm². Marcan believes that Starbuck is block Y, which seems likely given the size and SRAM configuration.

There is almost certainly a DSP somewhere on the die, although we know little to nothing about it. Like the ARM core, it should be pretty small.

Other potential functions of blocks on the GPU logic:

  • Command Processor and Thread Scheduler (not necessarily the same block)
  • Trisetup and rasterizer (R800 dropped that and delegated the workload to SPs)
  • Global Data Share (traditionally not very large, and likely encased nicely by some of the numerous embedded pools, in a much larger size)
  • A bunch of caches (vertex, texture) which could be really tiny or not so much (again, memory pools ahoy)
  • DMA engines
  • Ring buses
  • Tessellator (likely still sitting in fixed-function silicon)

It is likely that at least a couple of the blocks are used for Wii BC (see further discussion on this below).

It it worth noting at this stage that a large portion of the GPU logic is still unexplained. Even accounting for everything we know should be on there, there are a significant number of blocks left. There are a number of possibilities to consider. For one, it could simply be that there's obvious functionality we aren't considering. Otherwise there may be some customised units not usually present on GPUs. Alternatively, there's at least one crackpot theory that we're undercounting the SPUs, texture units and ROPs, chalking it up to an asymmetric shader design.

Wii Backwards Compatibility

The GPU is understood to provide full hardware level BC with Wii's GPU. Some of the components for this (e.g. MEM1 and MEM0) have already been explained, however the GPU logic itself needs to be accounted for. In considering this, the following comment from Ko Shiota, the Deputy General Manager of Nintendo's Product Development Department, is worth reading:

Shiota said:
Yes. The designers were already incredibly familiar with the Wii, so without getting hung up on the two machines' completely different structures, they came up with ideas we would never have thought of. There were times when you would usually just incorporate both the Wii U and Wii circuits, like 1+1. But instead of just adding like that, they adjusted the new parts added to Wii U so they could be used for Wii as well.

This seems a clear indication that there is not a full 1:1 copy of the Wii's Hollywood GPU on the die, but that at least some parts of its functionality are being handled by Wii U components.

Hollywood was about 72mm² on a 90nm process (inc embedded RAM and ARM), so even if there is a 1:1 copy, it would only be expected to take up around 10-20% of the space on a 146.48mm² 40nm die. Given Shiota's comments, the actual amount of GPU logic dedicated purely to Wii BC may be as low as 5-10%.

It is worth considering what Wii U components may provide BC for Hollywood functions. A possible candidate for this is block J1. If the blocks J1-J4 are indeed texture unit bundles, then J1 would seem to have some difference to the other three, due to its slightly larger size. This would be explained if J1 had extra hardware to allow it to also function as the texture unit for Wii mode.

Comparison Die Photos:

RV770 - Radeon HD4870, RV700 series, 55nm
Llano - APU with Evergreen graphics, 32nm
Flipper - Gamecube GPU, 180nm
Latte/Flipper comparison (assuming both at the same manufacturing node)

Info Directly From Chipworks:

Jim Morrison said:
Been reading some of the comments on your thread and have a few of my own to use as you wish.

1. This GPU is custom.
2. If it was based on ATI/AMD or a Radeon-like design, the chip would carry die marks to reflect that. Everybody has to recognize the licensing. It has none. Only Renesas name which is a former unit of NEC.
3. This chip is fabricated in a 40 nm advanced CMOS process at TSMC and is not low tech
4. For reference sake, the Apple A6 is fabricated in a 32 nm CMOS process and is also designed from scratch. It&#8217;s manufacturing costs, in volumes of 100k or more, about $26 - $30 a pop. Over 16 months degrade to about $15 each
a. Wii U only represents like 30M units per annum vs iPhone which is more like 100M units per annum. Put things in perspective.
5. This Wii U GPU costs more than that by about $20-$40 bucks each making it a very expensive piece of kit. Combine that with the IBM CPU and the Flash chip all on the same package and this whole thing is closer to $100 a piece when you add it all up
6. The Wii U main processor package is a very impressive piece of hardware when its said and done.

Trust me on this. It may not have water cooling and heat sinks the size of a brownie, but its one slick piece of silicon. eDRAM is not cheap to make. That is why not everybody does it. Cause its so dam expensive

Randy from Chipworks said:

Annotated Die Photo From Marcan:

@marcan said:

Thanks

Thanks to Forth Storm, wsippel, blu, Durante for helping organise and analyse, and all the folks who chipped in to buy the photo (who will be getting their money back from FS shortly!). And, of course, huge thanks to Chipworks for doing this for us!

Links

Chipworks write-up

Digital Foundry article

OP will be updated as we go
 

z0m3le

Banned
8 shader units with 20 alus in each = 160ALUs @ 550mhz = 176GFLOPs + 24gflops+ of fixed function shaders

Update: sorry I wasn't here to post earlier, the 176gflops is probably correct, since this part does seem to be vliw based. However there is almost certainly at minimum Hollywood inside this die as well considering how Wii u handles backwards compatibility. @550mhz that would give Hollywood 24gflops. Fixed functions are far better at doing their job than programmable shaders, but can do little else. It is more capable than 360, but it is impossible to really compare beyond that.

update 2: the more I research AMD GPUs, there is the possibility that it is using more ALUs in each SPU, if it was based on VLIW5 R700, it would in fact be 176GFLOPs, but it if was based on something like Trinity's VLIW4, it would be 32 ALUs per SPU, leaving it with 282GFLOPs, and if it was based on VLIW5 still, but a custom chip, it would have 40 ALUs per SPU like some people are speculating, leaving it with 352GFLOPs. Hollywood is likely still along for the ride, and that leaves it with 24GFLOPs+ of fixed function shaders. Sorry for jumping the gun, I was unaware that VLIW5 had chips with 40 ALUs per SPU, 20 was the norm.
 

Durante

Member
This is what I currently have:

c10234f5_d813301_13195mk1y.png


Red are the (presumed) shader clusters. Blue are the (presumed) TMUs.

The rest of the colorization are simply self-similar components that may not be layed down exactly in square blocks, but are still recognizable.
 

Shaanyboi

Banned
So... judging by the other thread, this is.... bad?

But like... objectively poor, or just "oh internet, you hyperbolic drama queen, you" bad?
 

cyberheater

PS4 PS4 PS4 PS4 PS4 PS4 PS4 PS4 PS4 PS4 PS4 PS4 PS4 PS4 PS4 PS4 PS4 Xbone PS4 PS4
Looking forward to the discussion on this.
 
Man, I always feel so dumb coming into these threads. That's why I never post in them.
Don't know why I keep clicking 'em though. lol
 

Kokonoe

Banned
Thread of the day. This is quite interesting, and as others have stated, thank you chipworks and fellow GAF members for doing this.
 

amaron11

Banned
"oh internet, you hyperbolic drama queen, you" bad?

Since it is STILL apparent that we know very little about how everything works together, it was just as waste of money.

There isn't going to be any definite statements made. Just more "it appears weak, but we don't know for sure'.

The games will always be the proof in the pudding.
 
This is what I currently have:

c10234f5_d813301_13195mk1y.png


Red are the (presumed) shader clusters. Blue are the (presumed) TMUs.

The rest of the colorization are simply self-similar components that may not be layed down exactly in square blocks, but are still recognizable.

NOOOO! I was working on one D=!!

Daaammnnn yoooou.

I still love you =/
 

LeleSocho

Banned
I would like the thank ChipWorks and the GAF members who made this possible.
Can't wait to see the fully updated OP
 

Thraktor

Member
A bit of my own speculation from an email earlier:

Regarding the thing with 7 units in the bottom left, would it make sense as a 7-channel DSP? I've heard that the DSP is 8-channel, but don't know if we've got confirmation on it. A 7-channel DSP might make sense, with 5 surround channels and 2 gamepad channels (with the subwoofer channel simply comprising the low frequencies of the 5 surround channels).
 

Meelow

Banned
No doubt the reason why Nintendo doesn't give detailed specs. On paper, it looks underwhelming but in real world scenarios it's surprisingly competent. At this point, I couldn't care less how many FLOPs the GPU has. I have no doubt the Wii U is a highly specialized machine like the GameCube was and the games will look and perform amazing.

From the spec thread if this means anything.
 

AzaK

Member
This is what I currently have:

c10234f5_d813301_13195mk1y.png


Red are the (presumed) shader clusters. Blue are the (presumed) TMUs.

The rest of the colorization are simply self-similar components that may not be layed down exactly in square blocks, but are still recognizable.

Durante, can you use imgur at all by any chance? Our firewall here is blocking that as a torrent site :)
From the spec thread if this means anything.

The problem is, if Nintendo has taken the "use fixed function custom shit to to anything good" route, third parties will treat it like the Wii - i.e. "Too hard and not worth the investment".

Also, Fourth Storm has said that Randy from Chipworks is going to give a bit of analysis on it.
 

kinggroin

Banned
So we know for sure, its 20 ALUs per shader unit?


If so, yuck. YUCK.

Also, that actually makes some of what we've seen coming out, all the more impressive considering this news and a 44w power draw.
 

Pagusas

Elden Member
200 for a photo? might as well buy a wii u and tear it apart to look inside lol

you still wouldnt have the equipment to take such a photo.


And wow, the Wii U is even weaker than I thought it would be, and I was in the "1.5x current gen" group. Jesus Nintendo, you cheap SOBs.
 

deviljho

Member
Thank you Chipworks! And thank you to everyone else involved!

Best part from Chipworks:

Since (at the time of publication for this blog) we are only 1 day away from the release of Dead Space 3, we thought that the gaming community might be able to put their funds to better use.

LOL
 
This really is perfectly timed to scupper the "NeoGAF as a hive of scum and villainy" GTTV spot, isn't it?

Because I expect most major gaming sites to run with this.
 

Durante

Member
So... judging by the other thread, this is.... bad?

But like... objectively poor, or just "oh internet, you hyperbolic drama queen, you" bad?
If it's 8 shader clusters, then this by itself is half of the lowest end of all reasonable predictions, in terms of GFLOPs. So that's very bad. But there's quite a bit of "stuff" on the die which we don't know about yet. And even the shader clusters are not certain.

Oh, and by the way, chipworks are the real heroes here. I mean sure, they get some PR from it, but they still did awesome work for free specifically for us.
 
Durante, can you use imgur at all by any chance? Our firewall here is blocking that as a torrent site :)


The problem is, if Nintendo has taken the "use fixed function custom shit to to anything good" route, third parties will treat it like the Wii - i.e. "Too hard and not worth the investment".

Also, Fourth Storm has said that Randy from Chipworks is going to give a bit of analysis on it.
IMGUR
 

pulsemyne

Member
What you take away from the picture is that it's a very custom design. Sadly that means that calculating its FLOPS could be very difficult. Also seems like a lot of fixed function stuff so that make the GFLOPS number even more pointless.
The whole thing does explain the ports problems though. It could take a while for devs to push the chip.
 
Status
Not open for further replies.
Top Bottom