• Hey, guest user. Hope you're enjoying NeoGAF! Have you considered registering for an account? Come join us and add your take to the daily discourse.

WiiU technical discussion (serious discussions welcome)

StevieP

Banned
This has little to do with the hardware technology/optimization, and a lot to do with the O/S.

Hardware-wise, the system should be able to load the browser 'instantly'. (just leave an Nmb webkit browser loaded in that massive 1GB of system RAM)

Software-wise, I don't think Nintendo has a particularly long and glorious history of writing modern operating systems :(.

Alternatively, maybe Nintendo actually use that 1GB of system RAM for something else.

I had a thought. What if, just like the Wii, exiting the application basically "reboots" the system? Except on Wii you can tell it was a reboot.

FYI, the Wii did this to try and prevent any unauthorized code from running I believe. The Wii U could be doing something similar but more transparent to the user aside from a longer load time.
 

Durante

Member
Not likely. Since most console games are using lower res alpha, they would have to completely botch access to the depth buffer, like DX9, and then some (how RTs are handled). :p Shouldn't be an issue for a console.

It's a fairly straight forward operation. Draw scene, downsize depth, draw particles to the off-screen buffer, compare, upscale & merge back into the scene.

Otherwise, the effect of resolution is pretty much linear in a high bandwidth scenario (fillrate, pixel shading costs are next), but you're reducing bandwidth consumption anyway when reducing particle res.
Yeah, I can't really think of any funy business going on there, which is why I'd like for someone to do a more in-depth analysis of framedrops with lots of alpha blending. Does it exist at all (for now we only have vague reports), and if so is it more or less pronounced than on 360?

So what are the chances of the WiiU being emulated on PC? Would the fact that it's an OoO CPU make it more difficult or impossible?
The chances are bad for quite some time into the future simply due to performance alone. However, it should be much more viable to emulate at some point than e.g. PS3 (because of Cell).
 

dumbo

Member
I had a thought. What if, just like the Wii, exiting the application basically "reboots" the system? Except on Wii you can tell it was a reboot.

On the PS3 you have the ribbon UI. It's a system application. You have a messaging application, which is part of this core UI, you can flick between them rapidly AFAIR.

Then you have the PSN shop. It's an entirely separate application. When you select this from the ribbon UI the PS3 unloads the ribbon UI and loads the shop. (loooooonnnng delay)

Given the memory-constrained nature of the PS3, that's not a huge shock. The PSN is a large application and probably couldn't fit in memory at the same time as the ribbon UI.

---

The Wii-U seems to do the same thing with the browser. That might be due to 'runtime resources' or simply that Nintendo didn't have time to build the browser into their main UI.

Either way, it's a very strange choice for a 2GB console in 2012.
 

USC-fan

Banned
Yes, as I posted above, the HD4850m is 45 watts and 800 GFLOPs @ 55nm making it ~17.8GFLOPs/Watt, given the 40nm assumed nature of the chip (could be lower but highly unlikely) 20+GFLOPs/Watt is a given for the mobile process of that series.

It's also worth noting that Wii U sells at a very small loss, so the chance that the GPU is somewhat costly is actually high, flash memory and DDR3 RAM isn't going to cost them much, also the CPU should be fairly cheap as well and after you consider it's MCM, that brings down wattage and costs. So a mobile gpu of the R700 series might actually be the best fit.

Considering the 25-30watt estimation of the GPU (under heavy load) would put the GFLOPs somewhere between these numbers:
25w X 20GFlops = 500 GFLOPs
30w X 20GFlops = 600 GFLOPs
25w X 26GFlops = 650 GFLOPs
30w X 26GFlops = 780 GFLOPs

I'm just doing mobile R700 GPU math @ 40nm, these should be fairly safe assumptions btw, but it doesn't mean Wii U is using a mobile part, just that these numbers are possible for R700 given that these chips actually exist. (HD 4830m uses ~28-30watts /w 768GFLOPs)
that is not possible. You are using bin parts.

So it is not possible. That why even at 28nm these cards do not even get close to these numbers-
 
Assuming your reasons are what I think they are, I think it can't be stressed enough by Shin'en the Wii U can't be looked at by face value alone.


"The CPU and GPU are a good match. As said before, today’s hardware has bottlenecks with memory throughput when you don’t care about your coding style and data layout. This is true for any hardware and can’t be only cured by throwing more megahertz and cores on it. Fortunately Nintendo made very wise choices for cache layout, ram latency and ram size to work against these pitfalls. Also Nintendo took care that other components like the Wii U GamePad screen streaming, or the built-in camera don’t put a burden on the CPU or GPU."

The only consoles that Shin'en has developed for are Nintendo consoles. In a sense, they really don't know how good the hardware is compared to the other two. And besides, since they have been developing for Nintendo consoles since 1999, they have probably learned a few tricks that Nintendo used and got the game running super fast.
 
I'm just going to talk about wattage performance of the HD4800 series, I don't really want to argue about if Nintendo is using this or not, since the Wii U is out, what we decide here won't make a bit of difference, and this is all speculation anyways, thanks ahead of time.

So, HD4870 is 8GFLOPs/watt and runs 750MHz, HD4830 (desktop) is 7.75GFLOPs/watt and uses 640 shaders. These are both 55nm parts, moving to 40nm for HD4770 allowed it to reach 12GFLOPs/watt and runs 640 shaders at 750MHz. This means an increase of 50% power efficiency, now lets look at the mobile end.

http://www.notebookcheck.net/AMD-ATI-Mobility-Radeon-HD-4850.13975.0.html


Those are 55nm parts, HD4850m is 800GFLOPs @ 45w making it about 17.8GFLOPs per watt. So using the HD4830 (a 40nm part) would yeild 26.7GFLOPs per watt, and at 25 watts = ~668GFLOPs

This is what R700 series does in the desktop numbers and assumes the same efficiency scale applies to the mobile parts when moving down to the 40nm process.
This is why I have chosen to go with the Radeon Mobility 4830 as the GPU base recently (with the amount of shaders cut down from 640 to 480 to make room for the eDRAM and *possibly* the DSP unless that little chip on the MCM IS the DSP). It's small and has good performace/watt.
 

z0m3le

Banned
that is not possible. You are using bin parts.

So it is not possible. That why even at 28nm these cards do not even get close to these numbers-

Actually, PS4 rumor that is being pushed around on the rumor thread is using a HD7970m.

Mobile parts, especially in older processes are much easier to get.

This is why I have chosen to go with the Radeon Mobility 4830 as the GPU base recently (with the amount of shaders cut down from 640 to 480 to make room for the eDRAM and *possibly* the DSP unless that little chip on the MCM IS the DSP). It's small and has good performace/watt.

Considering how much room the GDDR5 controller takes up on the chip, and all of the other things that would be removed, there is plenty of room for the chip to stay intacted, but it's really the performance we care about, not shader count.
 

The_Lump

Banned
The only consoles that Shin'en has developed for are Nintendo consoles. In a sense, they really don't know how good the hardware is compared to the other two. And besides, since they have been developing for Nintendo consoles since 1999, they have probably learned a few tricks that Nintendo used and got the game running super fast.


Could be. Although remember just because Shin'en the company have mainly developed on Nintendo consoles, doesn't mean everyone who works at Shin'en has only worked on Nintendo consoles. Not to mention they are privy to the same info we are (and then some, probably) regarding the other two. So I'm sure they aren't making a reference without any knowledge on the subject.
 

USC-fan

Banned
Just stop you have no clue what you're talking about.
Actually, PS4 rumor that is being pushed around on the rumor thread is using a HD7970m.

Mobile parts, especially in older processes are much easier to get.



Considering how much room the GDDR5 controller takes up on the chip, and all of the other things that would be removed, there is plenty of room for the chip to stay intacted, but it's really the performance we care about, not shader count.
 

z0m3le

Banned
Just stop you have no clue what you're talking about.
Try to discuss the topic, not me. If its so clear to you why I'm wrong about what I've said, you should be able to communicate that rather then ask me to stop discussing the topic.

please take any problems with me to PM, this thread doesn't need to end up like the other one.
 

JordanN

Banned
The only consoles that Shin'en has developed for are Nintendo consoles. In a sense, they really don't know how good the hardware is compared to the other two. And besides, since they have been developing for Nintendo consoles since 1999, they have probably learned a few tricks that Nintendo used and got the game running super fast.
When you have statements like this,
"This is true for any hardware"
It doesn't matter if they developed for Nintendo only. It's universal.

The rest of your post attempts at writing off Shin'en for either being "too good" or "they developed for Nintendo only so their say is worthless" when it ignores the issue at hand. It's not their fault they chose to learn the hardware they're on nor should it be used against them.
 

MDX

Member
Could very well be that Microsoft and Sony
might make use of DDR4 for their next console.

Some people will likely assume DDR4 is the better choice.
Is it? What hurdles will MS and Sony face by choosing this RAM?
Will it drive up the costs of their console or bottleneck their system?

I found this article very interesting:

The DDR4 memory interface will double the clock speed of earlier DDR3 devices, but some fundamental DRAM timing parameters will remain at the same number of nanoseconds, effectively doubling the number of memory clock cycles required for those timing parameters to elapse.

While every other part of the computing infrastructure seems to be getting faster, the latency trend for DRAM - as measured by DRAM clock cycles - has steadily increased over the last 3 generations of DRAM. The Read Latency and some other key timing parameters have increased from 2 clock cycles in DDR1 up to 11 clocks for high-speed DDR3-1600.

11222011_fig1a.jpg


While the perception is that each successive generation of DDR DRAM is roughly twice as fast as the last, what's actually happening is that the DDR core timing is staying relatively constant as measured in nanoseconds and thus is increasing when measured in clock cycles. The doubling of frequency and bandwidth while keeping DRAM core timing constant is achieved in DRAM by exploiting parallelism within the DRAM array,

Sony and MS might be faced with some serious latency. How will the counter act this?

The System Impact of increasing latency

Increasing latency while keeping all other things in the system equal will generally result in reduced CPU processing efficiency (as measured by the ratio of useful clock cycles to wait states) as the CPU needs to insert additional wait states to compensate for having to wait more clock cycles for DRAM data. This effect is well known and forces architectural changes in the CPU and the rest of the system to compensate for increased latency in the DRAM.

Some systems add on-chip cache memory to the CPU with less latency than external DRAM, and use that cache memory preferentially over external DRAM. The more cache memory that exists on chip, the fewer external DRAM transactions will occur, the CPU waits for DRAM less, and the CPU's efficiency is improved.

So like Nintendo, can we expect eDRAM being used? And more than what WiiU is using?

The negative aspects of adding cache memory are mainly issues of cost - external DRAM is very inexpensive, with historical prices as low as $0.70 per billion bits, whereas on-chip memory can be significantly more expensive than off-chip DRAM. Also, there is a practical limit on how many bits of cache memory can exist on the CPU die.

We dont know for sure how the eDRAM is setup in the WiiU.
But I maintain that the CPU and GPU have their own eDRAM.
From Iwata Asks:
CPU
data can be processed between the CPU cores and with the high-density on-chip memory much better, and can now be done very efficiently with low power consumption
This is what IBM has been saying since 2011.

GPU
The GPU itself also contains quite a large on-chip memory.
Keyword: ALSO

So each LSI has its own eDRAM for internal processing.

DDR4
Another commonly used technique to improve the efficiency of the CPU is to add (or increase the size of) an out-of-order execution pipeline in the CPU, such that read data for future commands may be fetched in advance of their execution, and write data storage may be delayed. This technique does increase CPU efficiency, but it comes at the expense of increased CPU complexity, area and power.

We all assume Nintendo has done this. Most likely Sony and MS will too.

The problem of DRAM latency is exacerbated by multi-core designs and SoC architectures where there are a number of clients competing for DRAM bandwidth - any client in the system is likely to experience increased latency simply because there are other masters who are already using the DRAM.

Interesting. Isnt there a rumor stating that the WiiU CPU has a "master core"?
Was this done to decrease the latency of having a multi-core chip?
But more importantly, how many cores do MS and Sony plan on having for their consoles?
Is 6 to 8 still the going bet for MS? And will Sony make use of SoC?

A theoretical method to improve the CPU latency would be to simply reduce the latency of the DRAM controller. While this is correct in theory - and low latency is a design goal of Cadence's memory controller IP solutions - too much simplification in the name of latency can reduce system performance.

Is this what Nintendo did. I dont think so. But some assume they probably did. I think Nintendo, more likely something similar to:

An advanced memory controller - for example, Cadence's DDR4/ DDR3 memory controllers - will include a look-ahead queue or pipeline for upcoming transactions to allow the memory controller to prepare the DRAM for transactions in the pipeline.

But that might not be sufficient:

DDR DRAM requires a delay of tRCD between activating a page in DRAM and the first access to that page. At a minimum, the controller should store enough transactions so that a new transaction entering the queue would issue it's activate command immediately and then be delayed by execution of previously accepted transactions by at least tRCD of the DRAM. At lower speeds of operation, for example DDR-800, the minimum amount of lookahead would be two cache lines; however with the increasing tRCD parameter of high-speed DDR4 at DDR-3200, most memory controllers would need a look-ahead queue storing a minimum of 6 cache line access requests to get full bandwidth out of the memory, as shown
11222011_fig3.jpg

Minimum look-ahead requirement of high-speed DRAM to ensure full bandwidth of DRAM. 16-bit and 32-bit systems assume 32-byte cache line, 64-bit system assumes 64-byte cache lines

Another problem that is exacerbated by high-speed DRAM is the effect of the activate-to-activate delay of the same bank in DRAM - the so-called tRC delay. If the memory controller receives a transaction to a recently-accessed bank, the memory controller must delay the next activate command to that bank such that tRC is not violated.

So Sony and MS would also need an advanced Memory controller, probably more robust than what the WiiU has. For example:

11222011_fig5.jpg

Figure -above- shows us that if we have back-to-back transactions to different rows in the same bank, the system would have to separate the two commands by as much as 72 clock cycles.
http://www.chipestimate.com/techtalk.php?d=2011-11-22


Sony and MS might have to implement a re-ordering controller.


Im wondering how much would the R&D costs increase implementing faster RAM like DDR4.
Because if they dont solve the latency problem, It sounds like they are going to have beefy, but constipated, machines.
 

Thraktor

Member
Interesting post, MDX, and it seems to confirm what I'd been thinking while I was researching possible RAM solutions for the Wii U; latency in general-purpose RAM is on a significant long-term upward trend. The only product (other than SRAM and pseudo-SRAM) bucking this trend is Micron's RLDRAM, which seems to be designed specifically for networking hardware, and is only available in much lower densities than DDR3, never mind the probable cost.

As far as PS4 and XBox3 are concerned, this is going to be a particular problem if the reports that they're using Jaguar-based CPUs are true. A quad-core Jaguar chip has just 2MB of L2 cache, or 512KB per core, which is equal to the lesser caches on the Wii U CPU, and is a quarter the amount that Wii U's mystery core gets. For heavily latency-limited code like pathfinding, that small cache combined with high-latency DDR4 is going to lead to a lot of wasted cycles (although at least Jaguar has an advantage over Xenon/Cell in this particular case, as it supports out of order execution).

If they go the Bulldozer (or derivative) route then things would be a bit better, as it's designed to have 2MB of L2 cache per "module" (a Bulldozer module is sort of halfway between a pair of cores and a single, multi-threaded core), or 1MB per thread, and an extra 8MB of shared L3 between the four modules, again 1MB per thread. Due to the large die-size of Bulldozer (315mm²), though, they'd have to slim the chip down to fit within cost and heat requirements, and that may mean dropping down to 2 modules/4 threads and keeping the per-thread cache quantities the same, or skimping on cache to keep as many threads as possible (SRAM cache is pretty transistor-intensive). Alternatively, the reason for the switch to Jaguar might simply be that they realised it made more sense to bulk up four Jaguar cores (with extra cache, for example) than to try to strip down an eight-threaded Bulldozer chip. Actually, Bulldozer's poor performance per watt and performance per mm² characteristics make it a pretty poor base for a console CPU, which should be built around those two properties.

I think I've wandered off-course a bit. Where was I? Oh, yes, latency can be a big issue. I strongly suspect that Nintendo have put a lot of emphasis on optimising the memory controller's real-world latency and throughput, possibly with help from AMD, or possibly with a third party such as Cadence.
 

blu

Wants the largest console games publisher to avoid Nintendo's platforms.
While this is correct, I think that the thousands of cycles lost to cache thrashing (and I don't disagree that this can add up, a classic death by a thousand cuts scenario) is not as bad as the millions of cycles lost when your physics core goes idle for lack of work and your AI core ends up bottlenecking the frame. We may have to agree to disagree here.
Well, it's all a matter of job granularity, apparently. The more irregular the jobs duration, the more the job's CPU affinity will need to yield, to avoid situations as the those you refer to. I still think that even with badly irregular job durations CPU affinity is something worth for the job queues to keep track of. But surely it can vary by situations, a lot.

The point of starting with the PS3 code is that the data locality problem has already been solved, as the data has already been parceled out into small chunks that can easily fit into cache or EDRAM scratchpad. It may not be necessary to even use the EDRAM scratchpad if each core's cache can be controlled effectively, although it may still be beneficial to preload the next job's data into scratchpad (as I'm assuming it's going to be quicker to warm the cache from EDRAM than main mem) if if you already know what the next job is going to be.
Exactly, warming the caches form eDRAM should be beneficial, but what I disagree with is the attempt to fit those jobs into SPU's shoes - those just don't match. There's absolutely no reason for the U-CPU jobs to confine themselves to localities of 256KB each - partitioning your "L3" that way would just be inefficient - you have to deal with unnecessary data redundancy (i.e. copies of the same data in each SPU locale), artificial data cutoff, etc.

Honestly though (and don't take this the wrong way) - I'm not really seeing a counter-proposal from you here, so if this underutilizes the platform (how?), can you explain to me how you would ideally break up the work for the Wii U?
I think that you are already getting my idea - keep a designated portion of eDRAM as a form of "L3" scratchpad, let the U-CPU caches get populated from there. If possible, use some for of (GPU) DMA to bulk-populate the L3 itself.

We may also be trying to answer different questions in our head here. I guess I am hearing that all the multiplatform games run horribly, which shocked me because I guess I bought into the hype that it would be a little bit more powerful than PS3/Xbox 360. So I guess I am coming from it from the question of "well, if you had to port an existing engine over" which is pretty much the situation that everyone who is not working at Nintendo is in, what would you do?
I'd port the pipeline as verbatim as possible and after spotting the bottlenecks I would cut off workloads from the hot locations, while considering the possibility to add some extra bells and whistles where there's unused headroom. Basically, I'd adjust the workloads without re-writing the pipeline. Of course, all that given the hypothetical power to decide what the project should end up looking like.

Anyhow, a small detour for this thread: a prima-vista per-clock comparison between Broadway (aka Gekko@729MHz) and my netbook's Ontario APU (Bobcat @1.333GHz with TurboBoost).

Basically, the test builds the same C++ matmul routine, trying to use as much as possible of the available CPU vector facilities, then times a large amount of iterations over the same arguments (so with some luck all data are sitting in L1, but without explicitly assigning any thread affinities), while fooling the compiler said arguments change at each iteration.

Toolchains:

Broadway: g++ (Debian 4.4.5-8) 4.4.5
optimisation options: -fno-rtti -ffast-math -fstrict-aliasing -mpowerpc -mcpu=750 -mpaired -funroll-loops -O3 -DNDEBUG

Ontario: g++ (Ubuntu/Linaro 4.6.1-9ubuntu3) 4.6.1
optimisation options: -fno-rtti -ffast-math -fstrict-aliasing -msse3 -mfpmath=sse -DSIMD_FP32_4WAY -funroll-loops -O3 -DNDEBUG

And here's the code, along with its build script.

It's worth noting that Debian's gcc 4.4.5 has zilch support for ppc750's paired-singles SIMD when it comes to intrinsic vector types, (or at least I failed to trigger it), so Broadway's code is 100% scalar. Which is absolutely not the case with Ontario's code, even though it was built for SSE3. Apropos, Ontario does not feature AVX, which is something to be found across the newer APUs.

Test results (best times from each platform, across a few runs):

Broadway: 16.3517 s
Ontario: 4.53022 s

Clock-normalized results:
$ echo "scale=4; 16.3517 / 4.53022 / (1333 / 729)" | bc
1.9739

So, basically, well SIMD-fied SSE3 code on the Ontario is ~ 2x Broadway's scalar code, per-clock.

Feel free to run that on the actual U-CPU ;p
 

Thraktor

Member
Anyhow, a small detour for this thread: a prima-vista per-clock comparison between Broadway (aka Gekko@729MHz) and my netbook's Ontario APU (Bobcat @1.333GHz with TurboBoost).

Basically, the test builds the same C++ matmul routine, trying to use as much as possible of the available CPU vector facilities, then times a large amount of iterations over the same arguments (so with some luck all data are sitting in L1, but without explicitly assigning any thread affinities), while fooling the compiler said arguments change at each iteration.

Toolchains:

Broadway: g++ (Debian 4.4.5-8) 4.4.5
optimisation options: -fno-rtti -ffast-math -fstrict-aliasing -mpowerpc -mcpu=750 -mpaired -funroll-loops -O3 -DNDEBUG

Ontario: g++ (Ubuntu/Linaro 4.6.1-9ubuntu3) 4.6.1
optimisation options: -fno-rtti -ffast-math -fstrict-aliasing -msse3 -mfpmath=sse -DSIMD_FP32_4WAY -funroll-loops -O3 -DNDEBUG

And here's the code, along with its build script.

It's worth noting that Debian's gcc 4.4.5 has zilch support for ppc750's paired-singles SIMD when it comes to intrinsic vector types, (or at least I failed to trigger it), so Broadway's code is 100% scalar. Which is absolutely not the case with Ontario's code, even though it was built for SSE3. Apropos, Ontario does not feature AVX, which is something to be found across the newer APUs.

Test results (best times from each platform, across a few runs):

Broadway: 16.3517 s
Ontario: 4.53022 s

Clock-normalized results:
$ echo "scale=4; 16.3517 / 4.53022 / (1333 / 729)" | bc
1.9739

So, basically, well SIMD-fied SSE3 code on the Ontario is ~ 2x Broadway's scalar code, per-clock.

Feel free to run that on the actual U-CPU ;p

So, you've demonstrated that a CPU with a SIMD unit runs SIMD code more efficiently than a CPU without a SIMD unit? Am I missing something here?

Edit: It's hardly surprising that you can't get GCC to compile for paired-singles, isn't that a data-type unique to Gekko/Broadway?
 

blu

Wants the largest console games publisher to avoid Nintendo's platforms.
So, you've demonstrated that a CPU with a SIMD unit runs SIMD code more efficiently than a CPU without a SIMD unit? Am I missing something here?
Well, as far as I can tell I've demonstrated that a CPU with a 4-way SIMD unit runs some ultra-friendly SIMD code 2x faster than a CPU which runs the same code in scalar form. From where I sit that does not speak bad about the latter CPU, given it has 2-way SIMD which has remained unused during the test.

edit: the paired-singles type is found in the ppc750CL, which is an off-the-shelf IBM CPU.
 

Thraktor

Member
Well, as far as I can tell I've demonstrated that a CPU with a 4-way SIMD unit runs some ultra-friendly SIMD code 2x faster than a CPU which runs the same code in scalar form. From where I sit that does not speak bad about the latter CPU, given it has 2-way SIMD which has remained unused during the test.

Ah, that's a fair point. So given that you're multiplying a load of 4x4 matrices, one would expect a ~4x improvement with a 4-way SIMD unit over an FPU operating purely on scalars, correct? And here we're only seeing ~2x. Can you run the code on the Ontario without the DSIMD_FP32_4WAY flag set for further comparison?

edit: the paired-singles type is found in the ppc750CL, which is an off-the-shelf IBM CPU.

Thanks, hadn't realised that.

Edit:

Answering my own question.

Yes, the bus is bidirectional so the theoretical max sustained read bandwidth for the Wii U would the same as the total theoretical bandwidth. And higher than the Xbox 360's read bandwidth if it does turn out to be ~12GB/s.

So one of the secrets to getting the maximum bandwidth out of the Wii U GPU would be to avoid write operations to the main memory as much as possible. On the 360, render to texture requires rendering to the 10MB eDRAM, and then transferring that texture to the shared RAM. On the Wii U you'd want to keep everything on-package within the eDRAM to avoid having to use some of your bandwidth for writing.

You'll also optimize for maximum bandwidth utilization the opposite way from the 360. On the 360, any time you're writing but not reading, or reading but not writing you're wasting potential bandwidth. On the Wii U you'll want to do the opposite. Try not to do any writing while reading, and group your writes together in bursts to avoid the penalty from switching the bus direction.

I just re-read this, and apologies in advance if my general confusion with how memory busses work (or how game engines work, for that matter) produces gibberish in the following sentences, but do I understand this right?

- The XBox360's GDDR3 bus gives 11.2GB/s read bandwidth and 11.2GB/s write bandwidth at all times.
- The Wii U's DDR3 bus gives 12.8GB/s read bandwidth or 12.8GB/s write bandwidth at any given time (switching between the two).
- The bulk of high-bandwidth writes on both platforms (especially Wii U) will be targeted towards the eDRAM (framebuffer, render targets, etc.)
- The GDDR3/DDR3 pools are mostly the subject of read operations (textures, geometry, etc.)
- Therefore, even mildly optimised Wii U ports of 360 code should be able to get the same effective read performance from the DDR3 so long as they just shift writes to the eDRAM wherever possible.

Then comes this (regarding the poor framerates during alpha-heavy scenes):

Not likely. Since most console games are using lower res alpha, they would have to completely botch access to the depth buffer, like DX9, and then some (how RTs are handled). :p Shouldn't be an issue for a console.

It's a fairly straight forward operation. Draw scene, downsize depth, draw particles to the off-screen buffer, compare, upscale & merge back into the scene.

Otherwise, the effect of resolution is pretty much linear in a high bandwidth scenario (fillrate, pixel shading costs are next), but you're reducing bandwidth consumption anyway when reducing particle res.

Am I right in assuming (again excuse me for my lack of understanding of the finer points of these graphical techniques) that on the XBox360 these alphas are being drawn to an off-screen buffer on the GDDR3, and then read in as a texture onto the GPU, through the texture units? Then, for the Wii U port, they draw this off-screen buffer to the eDRAM, as there's space there now, but have to write it over to the DDR3 and read it back in as a texture because (and this is just my assumption) the texture units can only read from the DDR3, not the eDRAM. Because the Wii U bus is bidirectional, doing a load of these write/read combos per-frame would force the bus to keep switching direction, incurring switching penalties that didn't exist on the XBox360. These switching penalties push the read throughput on the bus down below XBox360 levels, and hence you start to get frame drops where you didn't on the XBox360.
 

Thraktor

Member
A matrix multiply isn't actually the best case scenario for SIMD because you're multiplying rows by columns to get your final result. And only one of those will be aligned with a proper SIMD 4-vector, with the other non-contiguous in memory. You have to burn cycles to move things around a bit before you can use SIMD.

Or to put it another way, idealized code for calculating each value in the result matrix would look something like this:

load column, simd_register0
load row, simd_register1
dotproduct simd_register0, simd_register1

...but because if your columns are contiguous that means your rows aren't you end up something more like this:

load column, simd_register0
load row.x, simd_register1.x
load row.y, simd_register1.y
load row.z, simd_register1.z
load row.w, simd_register1.w
dotproduct simd_register0, simd_register1

(above is pseudo-code and not meant to correspond to any particular SIMD implementation)

The non-SIMD version would be something like 8 loads and 4 multiply-adds meaning the SIMD version is only half the number of instructions instead of the hoped for quarter the number.

Ah yes, I'd forgot that SIMD units don't have nice convenient 4x4 block registers to operate on rows and columns from. Would be useful, though, maybe I should get onto IBM...

More seriously, did you read my edit to the post above? Is my understanding of what you said accurate?
 

Thraktor

Member
So how would we put the performace of the CPU? Around the AMD Athlon II X3 400e?

It's too hard to tell. We don't know the clock speeds, we don't know what customisations (if any) there have been over Broadway, we don't know what SIMD functionality there is (if any§), we don't know how the interconnect operates, we don't know what's up with the mystery core, we don't even know if there is anything up with the mystery core.

I would say that it's very likely that, at the same clock speed, the mystery core would perform a large number of correlated A* searches over a large graph more quickly than a core of the AMD Athlon II X3 400e, but even that's a guess based entirely on the one thing we do know (the cache).

§ Excluding paired singles, of course.
 

AzaK

Member
This is pretty correct. Except that the 360 will need to write render targets from the eDRAM to the GDDR3 before they can be used by the GPU. So the 360 that will add some high-bandwidth writes and reads.

The ROPs are integrated into the eDRAM on the 360. So can only draw to the eDRAM. Additionally, the 360 GPU only has read access to the GDDR3, it cannot read from the eDRAM. Offscreen surfaces must be drawn into the eDRAM, copied to the GDDR3 and then read in as a texture by the GPU.

I suspect the eDRAM on the Wii U is general purpose and the Wii U GPU has read/write access. So for the Wii U port everything could be kept on eDRAM. The situation is the reverse of what you're saying (probably).

So if they just use the same techniques as the 360 and draw to EDRAM->copy to GDDR3 -> Draw from GDDR3, then that could seriously kill the Wii U correct? Not because it is actually having to move data per se, but doing so while some other core (software audio, disk streaming) is reading from main ram will kill the bus.
 

Thraktor

Member
This is pretty correct. Except that the 360 will need to write render targets from the eDRAM to the GDDR3 before they can be used by the GPU. So the 360 that will add some high-bandwidth writes and reads.

The ROPs are integrated into the eDRAM on the 360. So can only draw to the eDRAM. Additionally, the 360 GPU only has read access to the GDDR3, it cannot read from the eDRAM. Offscreen surfaces must be drawn into the eDRAM, copied to the GDDR3 and then read in as a texture by the GPU.

I suspect the eDRAM on the Wii U is general purpose and the Wii U GPU has read/write access. So for the Wii U port everything could be kept on eDRAM. The situation is the reverse of what you're saying.

Yes, I was assuming that the XBox360 GPU couldn't read from the eDRAM. Drawing to the eDRAM and then writing it out to the GDDR3 is basically what I was assuming, but my point on the XBox360 is that this can be done on a relatively unsaturated write section of the bus without having to interrupt to the read section of the bus, unlike if you do the same thing on the Wii U.

Do all reads on a GPU (or the R700 line in particular) have to come in through the texture units, or can the SIMD units read directly? Similarly with writes, do they all have to run through the ROPS? Because my assumption for the Wii U's GPU was basically:

Texture units: read access to DDR3
SIMD units: direct read/write access to both eDRAM and DDR3
ROPs: write access to eDRAM (and possibly DDR3)

I assumed that the texture units could only access the DDR3 because, well, the textures are almost always going to be there, and a load of extra wiring would be required to read textures from eDRAM. Of course if the texture units are the only way to read into the GPU, then my logic was flawed.

My assumption was then that the alpha code in all these engines is focussed on treating it as a texture, because that was the only way you could deal with it on PS360. And, because it works on the Wii U (if not well), they didn't see the point/have the resources to completely rewrite it in a paradigm more appropriate to the Wii U (i.e. the SIMD units directly operating on the buffers in eDRAM). So, they used the read-it-in-as-a-texture method, which causes the DDR3 bus switching penalties, which causes the DDR3 read throughput to drop, which causes the dropped frames.

However, if all reads have to come in through the texture units, then the texture units must be able to directly access the eDRAM, so they're just reading it in as a texture directly from the eDRAM, and the DDR3 bus is unaffected. Or, more succinctly, I was wrong.

What game is supposedly having a problem with transparencies?

Durante was claiming it was common across a few games. Ninja Gaiden was one, I think.


Edit: Actually, through this I was assuming that the DDR3 was performing largely continuous reads, but I suppose it wouldn't even matter, as a write/read combo would cause one bus switch penalty regardless of what the bus is otherwise doing, unless you managed to time it exactly in the middle of a write -> read switch that was occurring anyway.
 

Thraktor

Member
If the TMUs could not read from the eDRAM, then it would be difficult to emulate the 1 MB texture cache of Broadway, would it not?

R700 has it's own texture cache architecture (see here [pdf], pages 34 on, particularly diagram on page 40). I'm not sure of the sizes involved, but would a 1MB L2 cache not do the job?

On the other hand, that memory hierarchy may have been thrown out altogether during Nintendo and AMD's customisations of the chip.

Edit: I have a feeling that pdf will give me a sense of what I need to know, but I'm far too tired to make any sense of it at this hour. I'll have another crack tomorrow.
 

Fafalada

Fafracer forever
Thraktor said:
Ah yes, I'd forgot that SIMD units don't have nice convenient 4x4 block registers to operate on rows and columns from. Would be useful, though, maybe I should get onto IBM...
Some do - PSP SIMD had a multi-directional register-stack that you could access in any order and orientation (scalar, vector or matrix). Dreamcast supported matrix-view of register stack as well, but IIRC single-direction only, and it only had half the registers of modern SIMDs.
But IBM's policy with SIMD has been for decades that "less is more" - they favor over-simplification over flexibility of data manipulation.

People tend to over-estimate the importance of SIMD code in terms of CPU performance though - it's fun to have this flexibility but it still does less in general purpose code than you'd imagine.
 

blu

Wants the largest console games publisher to avoid Nintendo's platforms.
A matrix multiply isn't actually the best case scenario for SIMD because you're multiplying rows by columns to get your final result. And only one of those will be aligned with a proper SIMD 4-vector, with the other non-contiguous in memory. You have to burn cycles to move things around a bit before you can use SIMD.

Or to put it another way, idealized code for calculating each value in the result matrix would look something like this:

load column, simd_register0
load row, simd_register1
dotproduct simd_register0, simd_register1

...but because if your columns are contiguous that means your rows aren't you end up something more like this:

load column, simd_register0
load row.x, simd_register1.x
load row.y, simd_register1.y
load row.z, simd_register1.z
load row.w, simd_register1.w
dotproduct simd_register0, simd_register1

(above is pseudo-code and not meant to correspond to any particular SIMD implementation)

The non-SIMD version would be something like 8 loads and 4 multiply-adds meaning the SIMD version is only half the number of instructions instead of the hoped for quarter the number.

Did you actually peek at the code? There are no sparse reads, argument matrices are read linearly, and no dot products are used (not every SIMD ISA has horizontal ops, so I always avoid them if the algorithm allows). In pseudo code, the matmul routine does:

Code:
for i in (0, 1, 2, 3):
    dst_row_i = arg0_row_i.xxxx * arg1_row_0
    dst_row_i += arg0_row_j.yyyy * arg1_row_1
    dst_row_i += arg0_row_i.zzzz * arg1_row_2
    dst_row_i += arg0_row_i.wwww * arg1_row_3
end

If the above seems too abstract, here's the actual SSE3 code (including the repetitions outer loop; the actual matmul is in the LBB3_2 inner loop) as generated by clang++ 2.9. It's less optimal than g++'s version (which is the one I used for the timings) but is far more readable due to the missing loop unrolling (beats me why clang fails there):

Code:
        .align  16, 0x90
.LBB3_1:                                # =>This Loop Header: Depth=1
                                        #     Child Loop BB3_2 Depth 2
        leal    1(%rbx), %ecx
        shlq    $6, %rcx
        movl    %ebx, %edx
        shlq    $6, %rdx
        movaps  ma+48(%rcx), %xmm0
        movaps  ma+32(%rcx), %xmm1
        movaps  ma+16(%rcx), %xmm2
        movaps  ma(%rcx), %xmm3
        xorl    %ecx, %ecx
        .align  16, 0x90
.LBB3_2:                                #   Parent Loop BB3_1 Depth=1
                                        # =>  This Inner Loop Header: Depth=2
        movss   ma+4(%rdx,%rcx), %xmm4
        pshufd  $0, %xmm4, %xmm4        # xmm4 = xmm4[0,0,0,0]
        mulps   %xmm2, %xmm4
        movss   ma(%rdx,%rcx), %xmm5
        pshufd  $0, %xmm5, %xmm5        # xmm5 = xmm5[0,0,0,0]
        mulps   %xmm3, %xmm5
        addps   %xmm4, %xmm5
        movss   ma+8(%rdx,%rcx), %xmm4
        pshufd  $0, %xmm4, %xmm4        # xmm4 = xmm4[0,0,0,0]
        mulps   %xmm1, %xmm4
        addps   %xmm5, %xmm4
        movss   ma+12(%rdx,%rcx), %xmm5
        pshufd  $0, %xmm5, %xmm5        # xmm5 = xmm5[0,0,0,0]
        mulps   %xmm0, %xmm5
        addps   %xmm4, %xmm5
        movaps  %xmm5, -128(%rbp,%rcx)
        addq    $16, %rcx
        cmpq    $64, %rcx
        jne     .LBB3_2
# BB#3:                                 # %_ZN5matx43mulERKS_S1_.exit
                                        #   in Loop: Header=BB3_1 Depth=1
        movaps  -128(%rbp), %xmm0
        movaps  -112(%rbp), %xmm1
        movaps  -96(%rbp), %xmm2
        movaps  -80(%rbp), %xmm3
        movaps  %xmm3, ra+48(%rdx)
        movaps  %xmm2, ra+32(%rdx)
        movaps  %xmm1, ra+16(%rdx)
        movaps  %xmm0, ra(%rdx)
        addl    %ebx, %ebx
        decl    %eax
        jne     .LBB3_1

If you're worried by the swizzle op employed above (pshufd) - it is a low-latency op that is very easily hideable in a loop unroll. But the main sought effect of the loop unrolling would be the hiding of the load & multiplication latencies, which are the major source of latency in the code. Bottomline being, a proper matmul is an examplary usecase for SIMD.

But to answer Thraktor's perfectly valid question about how Ontario performs without the SIMD intrinsics - it finishes the test for 8.89305 s. IOW, the SIMD-fication of the code gives it a 1.963x boost. And for an extra point of reference, here's how a Cortex-A8 @800MHz performs at the same test:

A8@800MHz with 4-way SIMD: 19.1426 s
A8@800MHz scalar: 88.704 s

The huge gain seen in the SIMD case is largely due to the inadequacy of the A8's scalar FPU (aka VFPv3-lite).

People tend to over-estimate the importance of SIMD code in terms of CPU performance though - it's fun to have this flexibility but it still does less in general purpose code than you'd imagine.
Word.
 

blu

Wants the largest console games publisher to avoid Nintendo's platforms.
I confess I didn't peak at the actual code. It's been way long since I've had the time to get really low level on anything and my rust is really showing. I should probably make some time for myself to optimize an intersection test or something as a refresher.

Sorry for misleading anyone reading the thread with my incorrect example and babble.
Oh come on now, everybody makes the odd dubious assumption. You've been a great contributor to this thread, the subject of which is complex enough so that we can expect to see tons of wrong assumptions popping up in the future (not counting any latent current ones ; )
 

blu

Wants the largest console games publisher to avoid Nintendo's platforms.
No! I must be punished! I'm going to go through every post in the "awesome" fan art thread as penance.
Ok, that does sound a bit scary. Perhaps such a sever punishment is a bit..

FOR PETE'S, WHAT DID I CLICK!?

If I don't survive, tell my wife "hello".

EDIT: AAAAAIIIIEEEEEEEEHHAAHHHAHHHHAAAAAAAIIEEEEEEEEEEEE
Quick, somebody cut off his internet connection!
 

M3d10n

Member
There's something I just remembered which might cause an impact on performance for Wii U games if not dealt with properly. DirectX 10 level GPUs introduced "state objects", which allow groups of render states to be treated as GPU resources, like textures and vertex buffers. The idea was to reduce the amount of data being passed around when setting up different render states and increase opportunities for pipelining the GPU.

When using DirectX 10/11 on a PC, you are forced to use state objects: there is no immediate state switching. You're also advised to cache those state objects and avoid re-creating them every frame. The same goes for shader constants. Not using state and constant buffers properly can cause significant performance overhead when migrating DX9 code to DX10/11.

Since the 360 and PS3 are DX9 parts (with a few DX10-style modifications like the lack of fixed function), I wonder how much of an impact this has on the launch 3rd party ports.
 
No! I must be punished! I'm going to go through every post in the "awesome" fan art thread as penance.

If I don't survive, tell my wife "hello".

EDIT: AAAAAIIIIEEEEEEEEHHAAHHHAHHHHAAAAAAAIIEEEEEEEEEEEE

AH GEYAD! AHHHRRRRAAAAGGGG! AAAAAAGGGGGNNNNNAAAAAABBBBBAAAAA!!! Mis ojos! MIS OJOS!!!!
 

I think this article may be wrong. It doesn't takes into account than more mhz means faster cycles, it just assume same cycle time. And it gives more importance to cycle numbers than ns time.

As true as more mhz means more wasted cycles is as more mhz more cycles.

Take this graph as example:

sciencemark-2.png
 

Fafalada

Fafracer forever
blu said:
But to answer Thraktor's perfectly valid question about how Ontario performs without the SIMD intrinsics - it finishes the test for 8.89305 s. IOW, the SIMD-fication of the code gives it a 1.963x boost.
Out of curiosity - did you happen to look at the ASM output for broadway? I am curious if it even gets compiled to use fused MADDs - which would impact results too - I remember gcc in the past refused to use those on Risc platforms, but that was many years ago so ;)

Slightly OT - but I'd love to see how the test fares if you iterate the test on matrix arrays instead (forcing it to hit memory). Not so much WiiU related, since memory latencies shouldn't be as impressive as they were on GC/Wii.
 

blu

Wants the largest console games publisher to avoid Nintendo's platforms.
Out of curiosity - did you happen to look at the ASM output for broadway?
It wouldn't even occur to me to post performance measurements without looking at the generated code first ; )

I am curious if it even gets compiled to use fused MADDs - which would impact results too - I remember gcc in the past refused to use those on Risc platforms, but that was many years ago so ;)
Yes, both the ppc750 and the cortex-a8 (in the SIMD-fied case) targets end up using MADDs, exactly where you'd expect the algorithm to use them - for the "x += y * z" expression. The SSE3 code doesn't, for apparent reason (the ISA lacking a MADD op). Actually, this test made me look up some trivia on Bobcat - it has two FPU pipelines - an ADD and a MUL one, both capable of handling 2x ops/clock (which explains the 2x boost from the SIMD-fication), but which would make an FMADD pretty impossible for it carry out. Perhaps IBM did something right back in the day ; )

Slightly OT - but I'd love to see how the test fares if you iterate the test on matrix arrays instead (forcing it to hit memory). Not so much WiiU related, since memory latencies shouldn't be as impressive as they were on GC/Wii.
Well, I'd expect the timings to suffer precisely by the nominal RAM latency, as the test does the reading (and writing, for that matter) absolutely linearly, so without any prefetches you'd get a cache miss at the start of each new cache line while crawling up the arrays.

I think this article may be wrong. It doesn't takes into account than more mhz means faster cycles, it just assume same cycle time. And it gives more importance to cycle numbers than ns time.

As true as more mhz means more wasted cycles is as more mhz more cycles.
You may want to re-read the article.
 
You may want to re-read the article again.

He just inflated cycles to make his math fit.

DDR3 1600 CL11 is below low end. Cheapest DDR3 you can buy right now is 1600 CL9.

Maths:

Non-sequential reads is 2000 x (CL/Speed) = Latency in nanoseconds.

DDR3 1600 CL11= 13,75 ns, OK.

BUT

DDR3 1600 CL9= 11.25 ns.

DDR3 2400 CL11 (Not 17!)= 9,17 ns.

Of course, this is just part of the whole story concerning consecutive reads, writes, etc.
 

Thraktor

Member
No! I must be punished! I'm going to go through every post in the "awesome" fan art thread as penance.

If I don't survive, tell my wife "hello".

EDIT: AAAAAIIIIEEEEEEEEHHAAHHHAHHHHAAAAAAAIIEEEEEEEEEEEE

Well this thread has just taken a bizarre change in direction.

Anyway... after reading through the pdf I linked to earlier, as well as some other stuff on the R700 architecture and subsequent AMD architectures including GCN (and importantly getting some sleep), I think my brain has wrapped its way around the layout of these GPUs a bit better. Here's a block diagram of the RV770 for reference (the Wii U's GPU should be based on something like a scaled-down version of this, possibly heavily customised):

rv770-block.jpg


[Unfortunately I couldn't find a higher-res version, so it's a bit difficult to read, but for reference the blue things are texture units, the four units down at the bottom are ROPs (interfacing with the MCs, or memory controllers) and the little pale yellow things between the texture units and SIMD cores are "local data shares"]

Now, to answer my previous question, yes, all data coming from memory into the SIMD units seems to go through the texture units, at least in the R700 line (and, it appears, in subsequent AMD GPU lines, although my understanding of GCN is more hazy). I'm not exactly sure how vertex data gets in there, as according to the block diagram it seems to just exist on a infinite loop of SIMD -> Shader Export -> Vertex Assembler -> SIMD, etc., but there's probably something obvious I'm not seeing.

The SIMD units basically have a four level memory hierarchy. They have:
- register memory (directly alongside the SIMD unit)
- local data shares (shared between each SIMD "array", and accessed without the texture units)
- global data share (shared between all SIMD arrays, and accessed through the texture units)
- main memory (also accessed through the texture units, and through the L1 and L2 texture caches)

Now, the simplest explanation of how the GPU works in the Wii U is that instead of those memory controllers at the bottom there's just one memory controller, and it seamlessly controls access to the eDRAM and DDR3, to the point where as far as the CPU and GPU are concerned, they're just different locations in a memory map. Therefore, texture units and ROPs access memory through this memory controller, and can hence access both eDRAM and DDR3.

As this is the simplest explanation, it's probably the most likely one, but one thing does strike me as an issue. If you have the texture units reading seamlessly from both eDRAM and DDR3, then you have the texture caches seamlessly caching data from both eDRAM and DDR3. This means that a whole load of transistors are going to be used up caching data that is literally right beside it on the die, which strikes me as incredibly inefficient. To prevent this, you need two different data paths coming into the SIMD units; one from the DDR3 that goes through the caches and one from the eDRAM that doesn't. Now, what if you decide that the latter might as well bypass the texture units as well? Given that you're not going to be storing actual textures on the eDRAM (as opposed to buffer-objects-as-textures), it would free up the texture units and also reduce the latency between the SIMD units and eDRAM.

If you'll excuse the amateur 'shop, I've modified the above diagram to illustrate what I thing might be going on with memory access in the Wii U GPU:

wiiugpudiagram.png


(Ignore the number of cores and soforth)

Basically, my logic would be that each SIMD array has a texture unit (which reads from the DDR3 through the texture caches) and a load/store unit, which reads and writes directly through the eDRAM memory controller (illustrated by the little red rectangles I've added in). In your GPU code you then have two ways of bringing in data. You can either use a texture operation, which gets the texture unit to pull in the texture from the DDR3 and give you the texels you need into your register memory, or you have a load operation, which gets the load/store unit to pull the raw data in from eDRAM and into your register memory.

When porting transparency code from XBox360, then, you have code which is entirely based around texture operations, and this code works fine (for the most part) when you use it as-is on the Wii U. Just dump the buffer to DDR3 and read it in as a texture, exactly as on XBox360. It doesn't work perfectly, as alpha-heavy scenes cause frame drops as the DDR3 bus clogs up, but you're a small team doing a quick and cheap port of a launch game to new hardware, so fixing it (which would involve a large rewrite of the code involved) is fairly far down your list of priorities.

I also think that giving the SIMD arrays LSUs with direct access to the eDRAM would make sense from Nintendo's perspective, as they've evidently put a large emphasis on GPGPU functionality. The extremely low latency involved could potentially be very beneficial to certain compute workloads, such as the UE4 SVOGI example I gave previously.

There are also arguments as to why the memory controllers would be separate in the first place. Any unified memory controller would have to be a large and complex device, and dealing with the vast amount of memory operations in every direction would certainly increase the latency on both eDRAM and DDR3, something Nintendo would obviously want to avoid (and it seems they have avoided, going by dev comments). Furthermore, the eDRAM and DDR3 are likely operating at different frequencies (going by Matt's comments I'm guessing ~575MHz and 800MHz respectively), which would further complicate (and add latency to) any unified memory controller.

Of course, I'm working on broad outlines of the R700 architecture here, and the block diagram is only an abstraction which I might be inferring too much from. For all I know the actual underlying circuitry makes what I'm talking about impossible. This also seems to be very different to how AMD have done things with any of their architectures, but I suppose those are all focussed around one big pool of memory anyway. That all said, Nintendo and AMD have had 2+ years of R&D work to chop up the R700, so maybe something this radical is feasible.


He just inflated cycles to make his math fit.

DDR3 1600 CL11 is below low end. Cheapest DDR3 you can buy right now is 1600 CL9.

Maths:

Non-sequential reads is 2000 x (CL/Speed) = Latency in nanoseconds.

DDR3 1600 CL11= 13,75 ns, OK.

BUT

DDR3 1600 CL9= 11.25 ns.

DDR3 2400 CL11 (Not 17!)= 9,17 ns.

Of course, this is just part of the whole story concerning consecutive reads, writes, etc.

He's talking about DDR3 chips, you're talking about DDR3 DIMMs, which are two different things. If you look at Samsung (or Micron or Hynix)'s database of DDR3 chips, you'll find a very standard set of rated CAS timings for each data rate. Off the top of my head, I think it's

1600MT/s - CL11
2133MT/s - CL14
2400MT/s - CL17
etc.

Now, on a DIMM you can adjust these timings by overvolting/overclocking the chips, which is what's happening in those 1600MT/s CL9 DIMMs you're looking at. This might benefit performance
(it doesn't in a PC environment, it's just a selling point)
, but it does decrease the life of the chips, which is something you really want to avoid in an embedded setting (again, what the article's about).
 

Thraktor

Member
Please forgive me if I'm asking this in the wrong thread (maybe we need to start a new one?).

Can we discuss some of the technical problems with the system that many are experiencing?

-freeze during load
-freeze when changing controller order
-no disk read after swapping disks

Has there been a sensible explanation as to why so many people are havin these problems? Mind you, speak "dumb" to me. All this impressive jargon exchanged in these conversations, while interesting, are flyin so far over my head ;)

It's a platform launch, so there's an above-average amount of buggy/poorly optimised code. It doesn't really have anything to do with hardware, and Nintendo will (hopefully) deal with it with system updates over the next few months or so.

Edit: Re-reading over my above post, I've realised there was something I failed to take into account: if the eDRAM is taking the place of the 1T-SRAM for BC purposes, then the GPU has to be able to read textures from the eDRAM. In theory I suppose the texture units in the R700 might be so different to Flipper/Hollywood's units that bypassing them would still be a sensible choice, and they're just emulating Hollywood's texture functionality on the SIMD units instead, but it does strike a blow to my little theory.
 
Edit: Re-reading over my above post, I've realised there was something I failed to take into account: if the eDRAM is taking the place of the 1T-SRAM for BC purposes, then the GPU has to be able to read textures from the eDRAM. In theory I suppose the texture units in the R700 might be so different to Flipper/Hollywood's units that bypassing them would still be a sensible choice, and they're just emulating Hollywood's texture functionality on the SIMD units instead, but it does strike a blow to my little theory.

Nice analysis, Thraktor! And you already caught the one thing I was going to point out. That 1 MB of texture cache on Flipper/Hollywood needs to be accounted for somehow. It's highly unlikely that they've added extra SRAM onto their GPU.

My understanding of these things is pretty rudimentary, but here's what I've had in my head: Basically, they would scrap the L2 cache and instead, put the eDRAM on the other side of that crossbar. In order to get bandwidth comparable to the SRAM in a stock R700, it would have to be a 4096-bit bus, which we know is possible, thanks to wsippel's breakdown of the UX8GD macros. Now, make that eDRAM L2 fully read/write as in a Southern Islands card. For Wii BC mode, I'm thinking there would be a way for the texture units to bypass L1 and access the eDRAM directly.

Your idea of a LSU where the SIMD cores could access the eDRAM directly is also quite intriguing. I wonder often what, if any, changes were made to the R700 architecture beyond the on-chip eDRAM for Iwata to warrant calling it a GPGPU. This would seemingly fit that bill.

Also, a couple of interesting posts on Beyond3D:

http://forum.beyond3d.com/showpost.php?p=1682228&postcount=3523
http://forum.beyond3d.com/showpost.php?p=1682422&postcount=3552

This poster seems to be implying that there is no retail card comparable to the Wii U GPU and that the ratio of components would also be different than in retail configs. Looking at how the R700 architecture scales, we have seen 320:32:8 configuration. Why not take that lower end setup to its logical extension at 400:40:8? The R700 ROPs are apparently twice as efficient as previous gens. The extra shaders over 360 would also take into account both rendering to another screen and be enough for some GPGPU stuff. Looking at die sizes, it also seems to fit in with what we're looking at.

Just food for thought. I'm sure we'll get some more info tomorrow that disproves everything I just wrote. haha
 

Thraktor

Member
Nice analysis, Thraktor! And you already caught the one thing I was going to point out. That 1 MB of texture cache on Flipper/Hollywood needs to be accounted for somehow. It's highly unlikely that they've added extra SRAM onto their GPU.

My understanding of these things is pretty rudimentary, but here's what I've had in my head: Basically, they would scrap the L2 cache and instead, put the eDRAM on the other side of that crossbar. In order to get bandwidth comparable to the SRAM in a stock R700, it would have to be a 4096-bit bus, which we know is possible, thanks to wsippel's breakdown of the UX8GD macros. Now, make that eDRAM L2 fully read/write as in a Southern Islands card. For Wii BC mode, I'm thinking there would be a way for the texture units to bypass L1 and access the eDRAM directly.

Your idea of a LSU where the SIMD cores could access the eDRAM directly is also quite intriguing. I wonder often what, if any, changes were made to the R700 architecture beyond the on-chip eDRAM for Iwata to warrant calling it a GPGPU. This would seemingly fit that bill.

Also, a couple of interesting posts on Beyond3D:

http://forum.beyond3d.com/showpost.php?p=1682228&postcount=3523
http://forum.beyond3d.com/showpost.php?p=1682422&postcount=3552

This poster seems to be implying that there is no retail card comparable to the Wii U GPU and that the ratio of components would also be different than in retail configs. Looking at how the R700 architecture scales, we have seen 320:32:8 configuration. Why not take that lower end setup to its logical extension at 400:40:8? The R700 ROPs are apparently twice as efficient as previous gens. The extra shaders over 360 would also take into account both rendering to another screen and be enough for some GPGPU stuff. Looking at die sizes, it also seems to fit in with what we're looking at.

Just food for thought. I'm sure we'll get some more info tomorrow that disproves everything I just wrote. haha

The thing about a cache (and I'm assuming here that both "texture caches" are true caches, and not just local data stores) is that it automatically populates itself with copies of what's in memory, without the coder having to do anything. There's a whole load of underlying logic in caches which dynamically figures out what's going to be needed when (with surprising accuracy, for CPU caches at least). So, what a texture cache is doing is basically saying to itself "this section of this texture seems to be in demand a lot, so I'll hold it here in the cache", and then the texture units can access it at very low latency and very high bandwidth, rather than the high latency and low bandwidth of off-chip RAM.

As such, if in Wii mode the SIMD units were just reading all the texture data off the eDRAM, you wouldn't actually need the 1MB cache at all, because the eDRAM itself has such low latency and high bandwidth that it's equivalent to every texture being cached all the time. The benefits of caching go to zero as your memory hits the same performance as your cache.

Furthermore, you can't really replace the L2 texture cache with the eDRAM, because then you have to manually populate it, which would be a massive pain, and would take up eDRAM space that could be occupied with other, more important stuff. The L2 itself serves an important role in dynamically keeping hold of textures as and when they're needed, and not having that would likely leave the texture units rather starved of data, as they have to wait for it over the high latency, low bandwidth DDR3 bus.

Yeah, the core configuration could be anything, it's really hard to say. At a guess we'd be looking at a relatively ROP-heavy design, due to the fact you're rendering to two screens. In the tiny off-chance that my crazy theory is actually right, then we might see a relatively low number of texture units, as they can be bypassed when necessary.

Your comment about eDRAM macros actually just reminded me about a couple of calculations I'd intended to do. I wrote a post with some estimated ranges of bandwidth for the eDRAM a few days ago, and now that Matt has clarified his previous comment about the GPU clock, I feel I should redo them for some more precise results.

Assuming a clock speed of 575MHz, there are three possible scenarios for the clock speed:

4x 64Mb macros with 256bit interfaces -- total 1024bit interface for 32MB -- 71.9GB/s
32x 8Mb macros with 128bit interfaces -- total 4096bit interface for 32MB -- 287.5GB/s
32x 8Mb macros with 256bit interfaces -- total 8192bit interface for 32MB -- 575GB/s

Barring small variations from a slightly different clock speed, these are the only three possible bandwidth outcomes for the eDRAM. It's around 72GB/s, or it's around 287GB/s or it's around 575GB/s (okay, maybe not the last one :p).
 
The thing about a cache (and I'm assuming here that both "texture caches" are true caches, and not just local data stores) is that it automatically populates itself with copies of what's in memory, without the coder having to do anything. There's a whole load of underlying logic in caches which dynamically figures out what's going to be needed when (with surprising accuracy, for CPU caches at least). So, what a texture cache is doing is basically saying to itself "this section of this texture seems to be in demand a lot, so I'll hold it here in the cache", and then the texture units can access it at very low latency and very high bandwidth, rather than the high latency and low bandwidth of off-chip RAM.

As such, if in Wii mode the SIMD units were just reading all the texture data off the eDRAM, you wouldn't actually need the 1MB cache at all, because the eDRAM itself has such low latency and high bandwidth that it's equivalent to every texture being cached all the time. The benefits of caching go to zero as your memory hits the same performance as your cache.

Furthermore, you can't really replace the L2 texture cache with the eDRAM, because then you have to manually populate it, which would be a massive pain, and would take up eDRAM space that could be occupied with other, more important stuff. The L2 itself serves an important role in dynamically keeping hold of textures as and when they're needed, and not having that would likely leave the texture units rather starved of data, as they have to wait for it over the high latency, low bandwidth DDR3 bus.

Yeah, the core configuration could be anything, it's really hard to say. At a guess we'd be looking at a relatively ROP-heavy design, due to the fact you're rendering to two screens. In the tiny off-chance that my crazy theory is actually right, then we might see a relatively low number of texture units, as they can be bypassed when necessary.

Your comment about eDRAM macros actually just reminded me about a couple of calculations I'd intended to do. I wrote a post with some estimated ranges of bandwidth for the eDRAM a few days ago, and now that Matt has clarified his previous comment about the GPU clock, I feel I should redo them for some more precise results.

Assuming a clock speed of 575MHz, there are three possible scenarios for the clock speed:

4x 64Mb macros with 256bit interfaces -- total 1024bit interface for 32MB -- 71.9GB/s
32x 8Mb macros with 128bit interfaces -- total 4096bit interface for 32MB -- 287.5GB/s
32x 8Mb macros with 256bit interfaces -- total 8192bit interface for 32MB -- 575GB/s

Barring small variations from a slightly different clock speed, these are the only three possible bandwidth outcomes for the eDRAM. It's around 72GB/s, (*)or it's around 287GB/s(*) or it's around 575GB/s (okay, maybe not the last one :p).

Jesu Cristos, even THAT would be a bit excessive...eh, maybe not since THAT'S the frame buffer and needs to swap out anywhere between 512MB and a GB of data 30-60 times PER SECOND.
 

pottuvoi

Banned
As such, if in Wii mode the SIMD units were just reading all the texture data off the eDRAM, you wouldn't actually need the 1MB cache at all, because the eDRAM itself has such low latency and high bandwidth that it's equivalent to every texture being cached all the time. The benefits of caching go to zero as your memory hits the same performance as your cache.
Problem with that approach is that latency would be higher as would be the power usage as data would have to fetched from further a way and from bigger memory pool. (future GPUs will most likely use 1KB L0 caches for this reason.)
Using eDRAM saves a lot in terms of power usage when compared for DDR3.
 
Top Bottom