Support NeoGAF

z0m3le · Nov 13, 2013

megabytecr said:
I think you are into something, lightning and DOF are resource hogs and are widely implemented in Nintendo titles and would explain several other "mysteries" in the Latte thread such as a lot of unaccounted space in the GPU and please correct me if I am wrong there are some components that are larger than other similar components in the GPU, in this case the 5th shader.

This would also easily explain the difference in Nintendo titles and 3rd party. The interest thing here if I understand correctly is that this TEV shader could work as a "flexible" fixed function, meaning they can use it for lightning, DOF or other effects as they please, so it is not that rigid.

Yep, it sounds right, I only really have doubts because of the numbers I've "crunched" it's a lot of fixed function power. TEVs are programmable, they are primitive pixel Shaders, there were 4 in the Gamecube that allowed it to pull off various effects. However I believe it was a post from lostinblue that I learned TEV was programmable from, hopefully he and others more familiar with Gamecube hardware will chime in on this soon.

Saint Gregory · Nov 13, 2013

This new theory sounds a bit crazy to me but part of what keeps drawing me to these threads is the mystery of what's going on with this GPU. Ever since 160 ALUs was proposed it made sense but if you accept that at face value it leaves a massive chunk of the GPU essentially as empty space which would a very un-Nintendo approach to hardware design.

It's an odd coincidence though that z0m3le's numbers add up to 350 GFLOPS. Who was it that first floated that number? Was it Matt? If so he's generally more accurate than CBoaT with his leaks.

Effect · Nov 13, 2013

Tron#1 said:
no its the same 38 man team that worked on Wii games and blops2 last year.

With CoD I'm willing to believe the issue isn't so much the team doing the porting but the original game code itself. Black Ops 2 looked better then Ghosts. Not just on the Wii U but overall on other platforms as well. It ran better out the gate as well. There were more options on the PC version too compared to what Ghost has given. This is mainly because I believe Treyarch and IW don't share tech. They each have their own engines, etc. Treyarch overall has more experience with Nintendo hardware from as far back as the GameCube and likely helps when designing the game and working on the Nintendo hardware version. IW's games get ported by Treyarch but I doubt they even factor in Nintendo hardware during design. They barely factor in PC with how their PC versions run.

z0m3le · Nov 13, 2013

bgassassin said:
As I read Argyle's comments I also began to think about what you just said at the end here. And to kind a take a slightly different angle, but based on what he said, maybe the reason to increase the amount of threads was to avoid having idle shaders addressing that problem inconsistent/inefficient shader utilization. To get that 3.4 (I think) average as close to five as possible. As for reason, Nintendo logic is different from ours. For all we know the "Withered technology" ideology may have convinced them VLIW4 and GCN were too new to utilize out of (unnecessary) concern.

That isn't how vliw5 works, Each Instruction set is fed into a group of 5 ALUs (thus the 5 on VLIW5), AMD found 5 was too wide so VLIW4 was born and allowed only 4 ALUs grouped together to address an instruction set, allowing for more instruction sets to be processed by the same number of ALUs (meaning if you have 20 ALUs, you could process 5 instruction sets into 5 groups of 4 instead of 4 instruction sets fed into 4 groups of 5.) If you increase the number of ALUs per instruction set to 6, that same 3.4 ALUs average per instruction set would still exist, it wasn't about how many ALUs could run during a cycle, it was about how few ALUs were needed to address an instruction set.

This is why that is unreasonable, because VLIW5's set up is that an instruction set must be addressed by a VLIW group, increasing the numbers of ALUs in a group means more idle ALUs, not less.

Also because of how VLIW5 works, widening the big ALU in a group of 5, to do double work is only really useful if it addresses a separate instruction set, TEV instructions make perfect sense here while more vliw code would bottleneck in a single ALU, TEV instructions could be addressed by all 4 of these special ALUs in each SPU block.

The more I dive into this, the more obvious it becomes. It would be very hard to make VLIW5 work better with 24 wide SPU blocks.

AzaK · Nov 13, 2013

z0m3le said:
VLIW6, is a nickname to a VLIW set up with 6 shader units. In this case, we would be looking at 24ALUs per SPU block, for a total of 192 shaders, if you are saying that they simply made the 5th shader "wider" to allow for a second thread, this would be akin to what I'm saying, though I'm saying that extra thread is running TEV code with TEV logic attached to this larger 5th shader that is part of VLIW5 anyways. It would get around the limited register space and allow native TEV code to run through Latte without issue. It would also explain the extra silicon space and transistors on the chip.

I do have a problem with my theory, but it's not logical reasoning, simply "It's too good to be true" which is why I am waiting for someone with some Gamecube/Flipper/hollywood knowledge to lend their opinion to this. I originally thought, extra shaders, but given the picture and BG's commitment to 160 ALUs, I think this is more in line with answering the question. You could be right, that it is just a unique normal programmable shader, but AMD went with VLIW4 after VLIW5, because developers didn't use more than 4 shaders per pass on average, and that 4th shader was only being used less than half the time anyways, so adding another thread to dispel bottlenecks, without changing from VLIW5, is sort of out there with no real reason for Nintendo or AMD to do this.

VLIW 5 will be due to cost. Nintendo probably didn't want to pay for AMD's more recent VLIW4 no?

DonMigs85 · Nov 13, 2013

160 VLIW shaders would make it have even less flops power than a Radeon 4650.

Donnie · Nov 13, 2013

AzaK said:
VLIW 5 will be due to cost. Nintendo probably didn't want to pay for AMD's more recent VLIW4 no?

Really not sure it would cost more to be honest, considering we're really just talking about removing an ALU in comparison to VLIW5 (yeah there were other small changes at the time, but I'd think VLIW5 in the WiiU docs would just refer to the number of ALU's per thread processor). The reason they're using VLIW5 could be because VLIW4 wasn't introduced until 2 years after WiiU's development started. As well as the fact that VLIW4 isn't inherently better than VLIW5, certainly not in a console environment.

blu · Nov 13, 2013

Iwilliams3 approached me with a question which I think bears sharing the answer to in this thread:

Where do those 192 threads (or ALUs) fit in in what we currently know about Latte?

An ATI SIMD unit form the VLIW era can handle as many treads per clock as is the unit's amount of VLIW ALUs. For instance, an 80-SP unit from the R700 series constitutes 16 VLIW5 units, and so can do 16 threads/clock (16x VLIW5 = 80 SPs). Let's call this amount of threads per clock Tc. Now, a wavefront is a group (or in ATI terms a 'vector') of 4 x Tc threads, which execute across 4 clocks on a SIMD - without getting into too much details, this grouping is due to the way the SIMD register file is accessed. Let's call this fundamental clock constant of 4 clocks Q.

A GPU comprising of 32 VLIW units can handle 128 threads over the timespan Q, or put in other words, a 16 VLIW-unit SIMD has a wavefront of 64 threads over 4 clocks, just like an 8 VLIW SIMD (like the one I'm typing this on) has a wavefront of 32 threads over 4 clocks, etc, etc. Now, there's a catch here - those threads are not just a random bunch of threads - those threads are identical across the entire SIMD, and the entire SIMD unit is executing the same VLIW op from those identical threads over the span of a wavefront, i.e. Q. That amount Tc x Q is fundamental for the SIMD of a GPU, for the entire design of that SIMD is made with a single goal - that Tc x Q threads could be in-flight (i.e. running or ready to be run with zero latency) at any given moment. Basically, a VLIW SIMD unit keeps the state (read: registers) of Tc x Q threads, while it runs Tc of those at any given clock. So far so good?

Now, there were guesses earlier in the thread that 192 could be the number of threads corresponding to the above Tc x Q quantity of Latte, plus some 'spares' - threads all loaded to a SIMD (or to both SIMDs) and ready to run or running at any time. Let's see how viable that is, but for that we need to introduce another fundamental paradigm of the VLIW era - the clauses. A clause in our context is a sequence of ops which are of the same 'nature' - say, ALU, or fetch, or export. The AMD shader compiler groups the ops from a shader/kernel into such clauses while honoring the logical flow of the shader, and the thread scheduler (AKA 'sequencer', or SQ in ATI terms) schedules threads based on the readiness of those clauses to run. Once a clause is run on a given SIMD unit, it does not get preempted until it voluntarily yields, i.e. the multithreading on the SIMD is cooperative (yes, there are watchdog timers to take care of clauses hogging the SIMD). Say, in a given shader clause A does some tex fetches, followed by clause B which does computations with the result from those fetches. The sequencer would see the data dependence between the two (i.e. A->B), schedule clause A first, and mark clause B as ready-to-run once the result from A has been delivered. Now, this happens per kernel, and if you remember, a SIMD unit can only execute the same op from the kernel across all threads. So if our SIMD just finished with a 64-wide wavefront, and during that time the results for some of the threads waiting on clause B arrived, then perhaps it could schedule those ready-to-run threads for the next wavefront? Not really, because it would create a divergence in the SIMD - it would start executing ops from clause B, then, some wavefronts down the line, the rest of the threads waiting to proceed with clause B would unblock, but low and behold, they cannot join their siblings already running B on the SIMD because those are several instructions ahead! So the fact we chose to run some threads (out of the pool of on-board threads) as early as possible actually proved disruptive! Or, to sum it up, from all threads aboard (ie. in-flight) a SIMD, you want to either run all of them, or not run them at all! Now, how wide was our wavefront? It was 64-thread wide (or 128 threads across both SIMDs) - do you see now why keeping 96 threads (192 across both SIMDs) in flight would not be justified?

Ok, I'm taking a break here - enough of a wall-of-text for now.

megabytecr · Nov 13, 2013

z0m3le said:
Yep, it sounds right, I only really have doubts because of the numbers I've "crunched" it's a lot of fixed function power. TEVs are programmable, they are primitive pixel Shaders, there were 4 in the Gamecube that allowed it to pull off various effects. However I believe it was a post from lostinblue that I learned TEV was programmable from, hopefully he and others more familiar with Gamecube hardware will chime in on this soon.

Can this be flexible enough so that it can also be used for GPGPU purposes? I find laughable the Iwata remarks about GPGPU if the GPU is only 160 ALUs, another thing that fits with this theory.

Thanks Blu, I had to read it like four times but in the end you made yourself very clear, extra threads are a waste as they cannot be executed alone to speed up the process. Nice read although now I have a headache.

DrWong · Nov 13, 2013

COMPLEX STUFF...

... do you see now why keeping 96 threads (192 across both SIMDs) in flight would not be justified?

Weeeell... :]

AzaK · Nov 13, 2013

megabytecr said:
Can this be flexible enough so that it can also be used for GPGPU purposes? I find laughable the Iwata remarks about GPGPU if the GPU is only 160 ALUs, another thing that fits with this theory.

Thanks Blu, I had to read it like four times but in the end you made yourself very clear, extra threads are a waste as they cannot be executed alone to speed up the process. Nice read although now I have a headache.

Iwata wouldn't be lying and probably thinks the fact it can do compute at all is a positive that he can spin. I truly think Nintendo think the Wii U is a powerful machine and they probably also think the PS4 and XBO are insane, ridiculous and just beyond comprehension. To them 1GB of ram for game would be outrageous. I wonder what they did when they saw the others were going to have 8.

Fourth Storm · Nov 13, 2013

Nice breakdown, blu. Leaves us wondering what to make of the figure bg gave us, though...

Schnozberry · Nov 13, 2013

AzaK said:
Iwata wouldn't be lying and probably thinks the fact it can do compute at all is a positive that he can spin. I truly think Nintendo think the Wii U is a powerful machine and they probably also think the PS4 and XBO are insane, ridiculous and just beyond comprehension. To them 1GB of ram for game would be outrageous. I wonder what they did when they saw the others were going to have 8.

They probably didn't care. Infantilizing them and making it sound like they are completely ignorant that more powerful hardware exists is...well...kind of silly. They made a conscious choice to make a machine that targeted their own design philosophy. It was probably not the best decision if they wanted to target the dudebro blockbuster shooter/sports buyer, but they have their own goals in mind. Personally, I don't know why everyone is so obsessed with high octane hardware and the matching game budgets that come with it. It certainly hasn't led to better games. There are gems to be found, certainly, but overall I think it has led to a painfully banal industry homogenization. But I digress.

Any chance the integration of TEV instructions just means that Nintendo and AMD added additional circuitry to each SIMD to maintain compatibility, but also allows the GPU to juggle additional threads, dwarfed as they may be by the TEV's limited capabilities? Kind of an inane idea, but if the 192 threads is accurate, there has to be some accounting for the disparity between the ALU count and the number of threads.

v1oz · Nov 14, 2013

lostinblue said:
No, no dice.

Twilight princess is massive, it has draw distance lightning and geometry going for it, and you're being unfair in regards to geometry here too; see the rectangles spining around on the twilight realm? polygons being manipulated, and manipulating so many objects is an intensive feat.

No there's not so much geometry in Twilight Princess the environments and characters are very simple, but the draw distance is decent and the worlds are large but barren. The spiralling magic effects you see in the Twilight Realm are done by indirect texturing. The Zelda games are mostly about great art not tech - even Nintendo will tell you that.

Starfox Adventure looks like a first generation game next to it (and it is, it just shows) even if it pulls some nice effects here and there.What do textures have to do with anything here? GC had 24 MB of main RAM and 1.4 GB on disc, of course they had to overcompress TP assets, same thing didn't happen on Starfox adventures because it was comparably closed space, disc space wasn't much of an issue as it wasn't the RAM. The TP team was, at 2005 point, fighting it with all their might.

"What do textures have to do with anything" Lol that's a first!
Star Fox looks better in motion and in terms of IQ imho. As for TP it's mostly about the excellent art and great gameplay.

And in regards to water and particle effects Zelda TP completely smokes Starfox Adventures. It just doesn't do the fur/grass shader, but that's like saying a game like Metroid Prime is technically inferior due to that. TP uses EBMB very very well; I don't recall Starfox being all that good or particular on it. There's only so much you could do, and TP is a much more intrinsic game than Starfox Adventures ever was; for starters because it was done from the ground, unlike Starfox, you pointed as much, so there. If the game world consisted of closed levels like Starfox did and they wanted so, they could have pulled fur shading, at the cost of something else.

Star Fox has better particle effects. The water in TP is low res, not even as good as the water their prior game Mario Sunshine. I do not recall seeing EMBM bump mapping in Twilight Princess, but in any case SFA does also uses the technique to create a sense depth on surfaces.

Not sure what you mean when you say "TP is a much more intrinsic game than Starfox Adventures ever was". But you're right about SFA not being made from the ground up for Gamecube, but that is what makes it even more interesting considering all the constraints - imagine how good Rare's 2nd generation titles would have looked!

As for TP I wouldn't say it was 'done from the ground up' either, it was more a natural progression of their previous title. TP runs on the Wind Waker engine, and the WW engine itself was based off the OOT engine.

That doesn't change the fact TP is doing more. It's doing skeletal animation, detailed shadows that morph over objects, it's pulling the moving clouds over terrain, bloom, twilight realm increased geometry, we could really go on and on. It's missing a fur shader? so what, it's shaders/TEV utilization are certainly top range for the hardware.

They chose much better payoffs.Sure, I can agree with that.

The shadows in TP are simple and low res - they would not be what I would term as detailed. SFA has arguably better character animation on the main characters. The clouds TP are a skybox. And there no parts of TP that are pushing heavy geometry including Twilight Realm - Nintendo's artists are very conservative with their models.

The bloom lighting was good. That was an effect rarely seen on the Cube (evidenced by the removal of bloom from GC ports of multiplatform games like POP, Def Jam etc). But it's still not quite HDR.

But bare in mind Factor 5 could sacrifice things in order to have more performance here and there, like a demoscene dev. And that is not the best approach if you're trying to make a balanced game.

Look at the way walking pedestrians animate on Rogue Squadron 3. I thought so. Nintendo is not shy to allocate resources for costly things they think are worth it.

All games are built around sacrifices.

The Rogue Squadron series is a game about ships. The on-foot sections were an afterthought. So most effort was spent on the space combat.

ArchangelWest · Nov 14, 2013

Manmademan said:
X isn't looking like its anywhere close to Red Dead or GTAV, and what we've seen of Metal Gear just kills it.

The game may have its strong points, but technically its not impressive next to the best of the ps360.

lol Good one. You've got a little bit of a point with Metal Gear, but the rest of that? Please.

Argyle · Nov 14, 2013

blu said:
Iwilliams3 approached me with a question which I think bears sharing the answer to in this thread:

Where do those 192 threads (or ALUs) fit in in what we currently know about Latte?

An ATI SIMD unit form the VLIW era can handle as many treads per clock as is the unit's amount of VLIW ALUs. For instance, an 80-SP unit from the R700 series constitutes 16 VLIW5 units, and so can do 16 threads/clock (16x VLIW5 = 80 SPs). Let's call this amount of threads per clock Tc. Now, a wavefront is a group (or in ATI terms a 'vector') of 4 x Tc threads, which execute across 4 clocks on a SIMD - without getting into too much details, this grouping is due to the way the SIMD register file is accessed. Let's call this fundamental clock constant of 4 clocks Q.

A GPU comprising of 32 VLIW units can handle 128 threads over the timespan Q, or put in other words, a 16 VLIW-unit SIMD has a wavefront of 64 threads over 4 clocks, just like an 8 VLIW SIMD (like the one I'm typing this on) has a wavefront of 32 threads over 4 clocks, etc, etc. Now, there's a catch here - those threads are not just a random bunch of threads - those threads are identical across the entire SIMD, and the entire SIMD unit is executing the same VLIW op from those identical threads over the span of a wavefront, i.e. Q. That amount Tc x Q is fundamental for the SIMD of a GPU, for the entire design of that SIMD is made with a single goal - that Tc x Q threads could be in-flight (i.e. running or ready to be run with zero latency) at any given moment. Basically, a VLIW SIMD unit keeps the state (read: registers) of Tc x Q threads, while it runs Tc of those at any given clock. So far so good?

Now, there were guesses earlier in the thread that 192 could be the number of threads corresponding to the above Tc x Q quantity of Latte, plus some 'spares' - threads all loaded to a SIMD (or to both SIMDs) and ready to run or running at any time. Let's see how viable that is, but for that we need to introduce another fundamental paradigm of the VLIW era - the clauses. A clause in our context is a sequence of ops which are of the same 'nature' - say, ALU, or fetch, or export. The AMD shader compiler groups the ops from a shader/kernel into such clauses while honoring the logical flow of the shader, and the thread scheduler (AKA 'sequencer', or SQ in ATI terms) schedules threads based on the readiness of those clauses to run. Once a clause is run on a given SIMD unit, it does not get preempted until it voluntarily yields, i.e. the multithreading on the SIMD is cooperative (yes, there are watchdog timers to take care of clauses hogging the SIMD). Say, in a given shader clause A does some tex fetches, followed by clause B which does computations with the result from those fetches. The sequencer would see the data dependence between the two (i.e. A->B), schedule clause A first, and mark clause B as ready-to-run once the result from A has been delivered. Now, this happens per kernel, and if you remember, a SIMD unit can only execute the same op from the kernel across all threads. So if our SIMD just finished with a 64-wide wavefront, and during that time the results for some of the threads waiting on clause B arrived, then perhaps it could schedule those ready-to-run threads for the next wavefront? Not really, because it would create a divergence in the SIMD - it would start executing ops from clause B, then, some wavefronts down the line, the rest of the threads waiting to proceed with clause B would unblock, but low and behold, they cannot join their siblings already running B on the SIMD because those are several instructions ahead! So the fact we chose to run some threads (out of the pool of on-board threads) as early as possible actually proved disruptive! Or, to sum it up, from all threads aboard (ie. in-flight) a SIMD, you want to either run all of them, or not run them at all! Now, how wide was our wavefront? It was 64-thread wide (or 128 threads across both SIMDs) - do you see now why keeping 96 threads (192 across both SIMDs) in flight would not be justified?

Ok, I'm taking a break here - enough of a wall-of-text for now.

Of course. I don't think you can start only some of the threads in a wavefront.

Here's an idea though - people have said that Nintendo uses different terminology in their documentation than AMD. What if a Nintendo GPU "thread" = an AMD "wavefront"?

If the GPU is similar to a Caicos, this number seems to line up:

http://developer.amd.com/download/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf

(See Appendix D, page D-5)

AzaK · Nov 14, 2013

^ Does that matter though? Aren't you still batching up work into wavefront/thread and having to send that through 160 shaders?

It just seems strange that you could send 192 things at 160 things and expect it to work efficiently.

Fourth Storm · Nov 14, 2013

Argyle said:
Of course. I don't think you can start only some of the threads in a wavefront.

Here's an idea though - people have said that Nintendo uses different terminology in their documentation than AMD. What if a Nintendo GPU "thread" = an AMD "wavefront"?

If the GPU is similar to a Caicos, this number seems to line up:

http://developer.amd.com/download/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf

(See Appendix D, page D-5)

Ah, that does seem to line up. Thanks for the link.

AzaK said:
^ Does that matter though? Aren't you still batching up work into wavefront/thread and having to send that through 160 shaders?

It just seems strange that you could send 192 things at 160 things and expect it to work efficiently.

192 groups of 64...sent to 2 groups of 80 shaders. None of it makes sense!

bgassassin · Nov 14, 2013

I believe the discrepancy has been resolved. Looks like the way Nintendo reaches 192 is by adding the maximum amount of threads that can be enabled for each shader type (pixel, vertex, and geometry). Seems weird they would do it like that due to the potential confusion it can cause.

z0m3le said:
That isn't how vliw5 works, Each Instruction set is fed into a group of 5 ALUs (thus the 5 on VLIW5), AMD found 5 was too wide so VLIW4 was born and allowed only 4 ALUs grouped together to address an instruction set, allowing for more instruction sets to be processed by the same number of ALUs (meaning if you have 20 ALUs, you could process 5 instruction sets into 5 groups of 4 instead of 4 instruction sets fed into 4 groups of 5.) If you increase the number of ALUs per instruction set to 6, that same 3.4 ALUs average per instruction set would still exist, it wasn't about how many ALUs could run during a cycle, it was about how few ALUs were needed to address an instruction set.

This is why that is unreasonable, because VLIW5's set up is that an instruction set must be addressed by a VLIW group, increasing the numbers of ALUs in a group means more idle ALUs, not less.

Also because of how VLIW5 works, widening the big ALU in a group of 5, to do double work is only really useful if it addresses a separate instruction set, TEV instructions make perfect sense here while more vliw code would bottleneck in a single ALU, TEV instructions could be addressed by all 4 of these special ALUs in each SPU block.

The more I dive into this, the more obvious it becomes. It would be very hard to make VLIW5 work better with 24 wide SPU blocks.

I could have sworn it dealt with threads in some manner, but blu killed the notion and I believe the difference has been clarified.

z0m3le · Nov 14, 2013

bgassassin said:
I believe the discrepancy has been resolved. Looks like the way Nintendo reaches 192 is by adding the maximum amount of threads that can be enabled for each shader type (pixel, vertex, and geometry). Seems weird they would do it like that due to the potential confusion it can cause.

I could have sworn it dealt with threads in some manner, but blu killed the notion and I believe the difference has been clarified.

Yeah, I think wave fronts could easily be used as "threads" since it's a group of instruction sets. though it's a bit odd since as Blu pointed out, they are split between 4 different clock cycles. 48 wave fronts handled per clock in this instance.

I always thought my theory had a problem with how much fixed function processing power was used, however I think it would of been a far more interesting design and having 32 TEV pixel shaders would of been fun. I'm going to assume he mean wave fronts as threads; unless he meant ALUs, but I can understand the mix up. He is definitely right based on whatever he meant though, he has more knowledge about the Wii U after most wanted u and now working on project cars, then any other developer we've heard from

lwilliams3 · Nov 14, 2013

blu said:
Iwilliams3 approached me with a question which I think bears sharing the answer to in this thread:

Where do those 192 threads (or ALUs) fit in in what we currently know about Latte?

An ATI SIMD unit form the VLIW era can handle as many treads per clock as is the unit's amount of VLIW ALUs. For instance, an 80-SP unit from the R700 series constitutes 16 VLIW5 units, and so can do 16 threads/clock (16x VLIW5 = 80 SPs). Let's call this amount of threads per clock Tc. Now, a wavefront is a group (or in ATI terms a 'vector') of 4 x Tc threads, which execute across 4 clocks on a SIMD - without getting into too much details, this grouping is due to the way the SIMD register file is accessed. Let's call this fundamental clock constant of 4 clocks Q.

A GPU comprising of 32 VLIW units can handle 128 threads over the timespan Q, or put in other words, a 16 VLIW-unit SIMD has a wavefront of 64 threads over 4 clocks, just like an 8 VLIW SIMD (like the one I'm typing this on) has a wavefront of 32 threads over 4 clocks, etc, etc. Now, there's a catch here - those threads are not just a random bunch of threads - those threads are identical across the entire SIMD, and the entire SIMD unit is executing the same VLIW op from those identical threads over the span of a wavefront, i.e. Q. That amount Tc x Q is fundamental for the SIMD of a GPU, for the entire design of that SIMD is made with a single goal - that Tc x Q threads could be in-flight (i.e. running or ready to be run with zero latency) at any given moment. Basically, a VLIW SIMD unit keeps the state (read: registers) of Tc x Q threads, while it runs Tc of those at any given clock. So far so good?

Now, there were guesses earlier in the thread that 192 could be the number of threads corresponding to the above Tc x Q quantity of Latte, plus some 'spares' - threads all loaded to a SIMD (or to both SIMDs) and ready to run or running at any time. Let's see how viable that is, but for that we need to introduce another fundamental paradigm of the VLIW era - the clauses. A clause in our context is a sequence of ops which are of the same 'nature' - say, ALU, or fetch, or export. The AMD shader compiler groups the ops from a shader/kernel into such clauses while honoring the logical flow of the shader, and the thread scheduler (AKA 'sequencer', or SQ in ATI terms) schedules threads based on the readiness of those clauses to run. Once a clause is run on a given SIMD unit, it does not get preempted until it voluntarily yields, i.e. the multithreading on the SIMD is cooperative (yes, there are watchdog timers to take care of clauses hogging the SIMD). Say, in a given shader clause A does some tex fetches, followed by clause B which does computations with the result from those fetches. The sequencer would see the data dependence between the two (i.e. A->B), schedule clause A first, and mark clause B as ready-to-run once the result from A has been delivered. Now, this happens per kernel, and if you remember, a SIMD unit can only execute the same op from the kernel across all threads. So if our SIMD just finished with a 64-wide wavefront, and during that time the results for some of the threads waiting on clause B arrived, then perhaps it could schedule those ready-to-run threads for the next wavefront? Not really, because it would create a divergence in the SIMD - it would start executing ops from clause B, then, some wavefronts down the line, the rest of the threads waiting to proceed with clause B would unblock, but low and behold, they cannot join their siblings already running B on the SIMD because those are several instructions ahead! So the fact we chose to run some threads (out of the pool of on-board threads) as early as possible actually proved disruptive! Or, to sum it up, from all threads aboard (ie. in-flight) a SIMD, you want to either run all of them, or not run them at all! Now, how wide was our wavefront? It was 64-thread wide (or 128 threads across both SIMDs) - do you see now why keeping 96 threads (192 across both SIMDs) in flight would not be justified?

Ok, I'm taking a break here - enough of a wall-of-text for now.

Thanks for sharing your thoughts on this topic, blu. It did add a few things that we need to consider when we look that 192 threads spec.

krizzx · Nov 14, 2013

z0m3le said:
nope, if you read what I said, it's that the extra 32 threads come from TEV units, they are much higher clocked vs Gamecube's TEV units as well, Gamecube had 4 of these, and Wii U would have 32, clocked 340% higher than Gamecube.

Gamecube had 8GFLOPs, Wii U's 32 TEV pixel shaders (if they exist) would be 8 times more, so 64 GFLOPs at 162MHz, but since they are clocked at 550MHz that is 340% higher or 174GFLOPs.

Now this is very interesting. And given the efficiency of the TEV, this would actually push the theoretical performance of the Wii U by more than twice what people believe.

bgassassin · Nov 14, 2013

z0m3le said:
Yeah, I think wave fronts could easily be used as "threads" since it's a group of instruction sets. though it's a bit odd since as Blu pointed out, they are split between 4 different clock cycles. 48 wave fronts handled per clock in this instance.

I always thought my theory had a problem with how much fixed function processing power was used, however I think it would of been a far more interesting design and having 32 TEV pixel shaders would of been fun. I'm going to assume he mean wave fronts as threads; unless he meant ALUs, but I can understand the mix up. He is definitely right based on whatever he meant though, he has more knowledge about the Wii U after most wanted u and now working on project cars, then any other developer we've heard from

I agree with your comments about the dev. To elaborate a little further on how it's listed, where MT = Max Threads:

When geometry shader threads are disabled: (PSMT + VSMT + GSMT = 192)

When geometry shader threads are enabled: (PSMT + VSMT = 156)

The amount of MT for PS and VS changes depending on the GS being enabled or disabled. Also it's referred to as the thread limitations for both GX2 and GPU7.

Argyle · Nov 14, 2013

krizzx said:
Now this is very interesting. And given the efficiency of the TEV, this would actually push the theoretical performance of the Wii U by more than twice what people believe.

Not that I think that there is TEV hardware in Latte, but I would imagine the TEV is extremely inefficient at running general shader code, otherwise everyone's GPU would be a bunch of TEVs stuck together.

lostinblue · Nov 14, 2013

Argyle said:
Not that I think that there is TEV hardware in Latte, but I would imagine the TEV is extremely inefficient at running general shader code, otherwise everyone's GPU would be a bunch of TEVs stuck together.

TEV is a fixed function combiner.

It can't run "code" per see only answer calls, for it's a 1+1=2 part, it just happens that since it has multiple stages you can up your ante progressively, which results in relatively both "complex" and "inexpensive" effects. Simple shaders as it may.

It's efficient at doing just that; but since you're doing some of that stuff via shaders anyway, if TEV is in there it might be cheaper to go that route; it's the fixed function route see, one size can't fit all, but it sure can fit some.

But like you I very much doubt it is in, as it would be such an easy thing to leak by any dev out there; I don't believe we wouldn't know so at this point.

ArchangelWest · Nov 15, 2013

lostinblue said:
TEV is a fixed function combiner.

It can't run "code" per see only answer calls, for it's a 1+1=2 part, it just happens that since it has multiple stages you can up your ante progressively, which results in relatively both "complex" and "inexpensive" effects. Simple shaders as it may.

It's efficient at doing just that; but since you're doing some of that stuff via shaders anyway, if TEV is in there it might be cheaper to go that route; it's the fixed function route see, one size can't fit all, but it sure can fit some.

But like you I very much doubt it is in, as it would be such an easy thing to leak by any dev out there; I don't believe we wouldn't know so at this point.

If there is such a degree of fixed function performance built in, then it would explain why the lighting in many games always seems to have the same look and feel to it on Wii U, and the amazing Draw distance that we've seen at times has caused many a double take, since such a thing doesn't come cheaply performance-wise.

I know some people hate on 'secret sauce', or see it as inferior, but performance is performance, and obviously this baked in functionality wouldn't be some old cheap 12 year old Game Cube stuff. It would be greatly improved by time alone. If Nintendo has found a way on this tiny little chip to give more than adequate lighting, draw distance and tessellation with the 'flip of a switch' per se, in addition to the raw Gigaflop performance being quite a bit above last gen consoles, AND all bolstered with performance enhancing eDRAM, then the Wii U is just that in the presence of its more powerful next gen competition. Adequate.

StevieP · Nov 15, 2013

ArchangelWest said:
lol Good one. You've got a little bit of a point with Metal Gear, but the rest of that? Please.

What we've seen of Metal Gear has been running on PC.

lwilliams3 · Nov 15, 2013

ArchangelWest said:
If there is such a degree of fixed function performance built in, then it would explain why the lighting in many games always seems to have the same look and feel to it on Wii U, and the amazing Draw distance that we've seen at times has caused many a double take, since such a thing doesn't come cheaply performance-wise.

I know some people hate on 'secret sauce', or see it as inferior, but performance is performance, and obviously this baked in functionality wouldn't be some old cheap 12 year old Game Cube stuff. It would be greatly improved by time alone. If Nintendo has found a way on this tiny little chip to give more than adequate lighting, draw distance and tessellation with the 'flip of a switch' per se, in addition to the raw Gigaflop performance being quite a bit above last gen consoles, AND all bolstered with performance enhancing eDRAM, then the Wii U is just that in the presence of its more powerful next gen competition. Adequate.

In the case of Nintendo, a game engine for the Wii U was likely made and all of the assets and coding were shared to all of their teams. Since Nintendo is used to working on systems that are at least an order of magnitude weaker than the Wii U, they may have found ways on how to utilize (or sometimes fake) effects at a lower power cost than normal with current technology.

Also, there may have been some other optimizations done to Latte beyond its r700 architectural base that we haven't figure out yet. There are still other options that would explain what we are seeing in some of these Wii U other than utililzing TEVs.

Nostremitus · Nov 15, 2013

lwilliams3 said:
In the case of Nintendo, a game engine for the Wii U was likely made and all of the assets and coding were shared to all of their teams. Since Nintendo is used to working on systems that are at least an order of magnitude weaker than the Wii U, they may have found ways on how to utilize (or sometimes fake) effects at a lower power cost than normal with current technology.

Also, there may have been some other optimizations done to Latte beyond its r700 architectural base that we haven't figure out yet. There are still other options that would explain what we are seeing in some of these Wii U other than utililzing TEVs.

Weren't they only able to do that on Wii because of TEV utilization? I don't see how TEV expertise would allow them to expertly squeeze more out of fully programmable shaders than people who spent the last full generation trying to do it.

lwilliams3 · Nov 16, 2013

Nostremitus said:
Weren't they only able to do that on Wii because of TEV utilization? I don't see how TEV expertise would allow them to expertly squeeze more out of fully programmable shaders than people who spent the last full generation trying to do it.

I need to check back to some posters that explained how TEVs worked on the GameCube/Wii, but I believe that it was said there is nothing TEVs can accomplish that modern shaders can do. TEV are severely limited on what they can do. Since Nintendo has worked with hardware with that type of limitation for a long period of time, they likely have a different POV on modern shaders, and may have requested customizations on their hardware that Sony and Microsoft wouldn't do.

Nostremitus · Nov 16, 2013

lwilliams3 said:
I need to check back to some posters that explained how TEVs worked on the GameCube/Wii, but I believe that it was said there is nothing TEVs can accomplish that modern shaders can do. TEV are severely limited on what they can do. Since Nintendo has worked with hardware with that type of limitation for a long period of time, they likely have a different POV on modern shaders, and may have requested customizations on their hardware that Sony and Microsoft wouldn't do.

That's not at all what I meant. I meant that coding to TEV is a different animal. Being good at TEV wouldn't make you an expert coder for modern shaders. That's like saying the world's best helicopter pilot should be able to jump in a jet and outperform the jet pilots since they're both aircraft isn't it? Sorry for the abstract comparison, I realize that they'd still know how to code. I just don't see them being able to outperform other studios more familiar with these types of shaders unless they added something they were intimately familiar with.

As far as customizations to suit their coding style, isn't that what was already being discussed? Their POV is directly related to coding for TEV.

fred · Nov 16, 2013

If they are using fixed functions (and I've had the belief that they have been for Latte myself for quite some time now) I'm pretty sure that they've evolved the TEV Unit.

There's one very important thing that everyone seems to be forgetting (and I've brought this up a few times): Nintendo are fully aware that the main reason why the Wii didn't receive down-ports from the PS3 and 360 is that the TEV Unit gave the Wii a nonstandard rendering pipeline. This meant that Wii multiplatform games - such as CoD, Guitar Hero and Rock Band for example - had to be developed separately from the other SKUs from the ground up.

If Latte is using fixed functions then they would have had to make changes to the TEV Unit so that games could be ported to and from the console without a problem. They've probably made so many changes that they've completely had to rename the thing.

blu · Nov 17, 2013

Argyle said:
Of course. I don't think you can start only some of the threads in a wavefront.

Here's an idea though - people have said that Nintendo uses different terminology in their documentation than AMD. What if a Nintendo GPU "thread" = an AMD "wavefront"?

If the GPU is similar to a Caicos, this number seems to line up:

http://developer.amd.com/download/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf

(See Appendix D, page D-5)

Interesting..

The GPU in Ontario is a Cedar. The CU/SIMD in Cedar is a 8 VLIW5 design, or in other words there are 16 'stream cores' (in AMD marketing speak, or Processing Elements in OpenCL terms) across two CUs/SIMDs. Naturally its wavefront width is 32. Now, a wavefront is what the SIMD unit runs, but an OpenCL workgroup is not limited to that size - its size is a multiple of the wavefront size. Apparently those threads cannot be run at the same time on a CU, so they are executed in batches of wavefront size (also known as 'preferred workgroup size multiple' in OpenCL terms). Now, for Cedar the maximum workgroup size is 128 threads or 4 wavefronts (section 6.6.4). And those 96 max wavefronts per CU (192 per GPU) in section D.5 would translate to 6144 work items. Curiously enough, though, AMD's actual OCL stack for Loveland (the Cedar in Ontario) reports a max workgroup size of 256 items, or 8 wavefronts. Bottomline being, some of those figures might off, but let's move on.

Let's consider the corresponding breakdown for Latte. Two 16-PE CUs, 64-wide wavefronts. Assuming the same max ratio of 4 wavefronts per workgroup would produce 256 threads/CU. Assuming 192 wavefronts for the entire Latte, 96 wavefronts per CU, that would automatically amount to max 12288 work items for the entire GPU.

ninjablade · Oct 31, 2014

I know this is a very old thread but i found An interesting story I came across backing up these number by developer's before the wiiu even came out.

http://www.gamesindustry.biz/articl...ess-powerful-than-ps3-xbox-360-developers-say

No, it's not up to the same level as the PS3 or the 360. The graphics are just not as powerful."

Another developer agreed: "Yeah, that's true. It doesn't produce graphics as well as the PS3 or the 360. There aren't as many shaders, it's not as capable. Sure, some things are better, mostly as a result of it being a more modern design. But overall the Wii U just can't quite keep up

KingSnake · Oct 31, 2014

ninjablade said:
I know this is a very old thread but i found An interesting story I came across backing up these number by developer's before the wiiu even came out.

http://www.gamesindustry.biz/articl...ess-powerful-than-ps3-xbox-360-developers-say

You are trying too hard. Bumping an old thread with an article from 2012, really?

Edit: and isn't this against the rules?

The_Intruder · Oct 31, 2014

ninjablade said:
I know this is a very old thread but i found An interesting story I came across backing up these number by developer's before the wiiu even came out.

http://www.gamesindustry.biz/articl...ess-powerful-than-ps3-xbox-360-developers-say

Old and a blast from the past....

DaSorcerer7 · Oct 31, 2014

Bayonetta hurt runs deep.

FTF · Oct 31, 2014

lol oh boy

ninjablade · Oct 31, 2014

KingSnake said:
You are trying too hard. Bumping an old thread with an article from 2012, really?

Edit: and isn't this against the rules?

but the story was never posted in this thread, it basically gives more weight to the OP numbers.

Zalman · Oct 31, 2014

ninjablade said:
I know this is a very old thread but i found An interesting story I came across backing up these number by developer's before the wiiu even came out.

http://www.gamesindustry.biz/articl...ess-powerful-than-ps3-xbox-360-developers-say

I'm not surprised you were the one to bump this thread with that article.

wsippel · Oct 31, 2014

ninjablade said:
but the story was never posted in this thread, it basically gives more weight to the OP numbers.

It was posted and discussed time and time again.

Anyway, the numbers are correct. They're also misleading. While GPU7 is less powerful on paper, it's quite a bit more efficient. But only developers who actually worked on the thing would know. Also, 2012 articles are useless because first generation devkits had a lot of bugs and the SDK and documentation were incomplete.

Captain Smoker · Oct 31, 2014

Nintendojitsu · Oct 31, 2014

Le sigh...

KojiKnight · Oct 31, 2014

ninjablade said:
I know this is a very old thread but i found An interesting story I came across backing up these number by developer's before the wiiu even came out.

http://www.gamesindustry.biz/articl...ess-powerful-than-ps3-xbox-360-developers-say

Did a year old thread really need a bump with a two year old article? This is not going to end well...

ninjablade · Oct 31, 2014

wsippel said:
It was posted and discussed time and time again.

Anyway, the numbers are correct. They're also misleading. While GPU7 is less powerful on paper, it's quite a bit more efficient. But only developers who actually worked on the thing would know. Also, 2012 articles are useless because first generation devkits had a lot of bugs and the SDK and documentation were incomplete.

These were developers that were making launch games for the wiiu, they seem to think the hardware is worse, which kinda explain all the bad 360/ps3 ports.

mclem · Oct 31, 2014

I'm now curious what person said "The graphics are just not as powerful". That's a horrible mangling of the English language.

ninjablade said:
These were developers that were making launch games for the wiiu, they seem to think the hardware is worse, which kinda explain all the bad 360/ps3 ports.

It would explain it if all the ports were bad. However, there's a witch who might disagree with that.

Übermatik · Oct 31, 2014

ninjablade said:
I know this is a very old thread but i found An interesting story I came across backing up these number by developer's before the wiiu even came out.

http://www.gamesindustry.biz/articl...ess-powerful-than-ps3-xbox-360-developers-say

Oh my days... Living in the past.

KingSnake · Oct 31, 2014

mclem said:
i'm now curious what person said "The graphics are just not as powerful". That's a horrible mangling of the English language.

It would explain it if all the ports were bad. However, there's a witch who might disagree with that.

And a NFSMWU, and a Deus Ex ...

walnuts · Oct 31, 2014

People still care of this? For gods sake, stop complaining about power and play the games you like and love.

ninjablade · Oct 31, 2014

mclem said:
i'm now curious what person said "The graphics are just not as powerful". That's a horrible mangling of the English language.

It would explain it if all the ports were bad. However, there's a witch who might disagree with that.

of course the hardware has some advantages, but even the better ports are only slightly better, while most ports on wiiu had a huge disadvantage in frame rates, running 5-10 fps worse.

Support NeoGAF

Wii U GPU base specs: 160 ALUs, 8 TMUs, 8 ROPs; Rumor: Wii U hardware was downgraded

Banned

Member

Member

Banned

Member

Member

Member

Wants the largest console games publisher to avoid Nintendo's platforms.

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Banned

Member

Junior Member

Member

Member

Banned

Member

Banned

Member

Member

Member

Member

Member

Wants the largest console games publisher to avoid Nintendo's platforms.

Banned

The Birthday Skeleton

Member

Member

Member

Banned

Member

Banned

Member

Banned

Member

Banned

Member

Member

The Birthday Skeleton

Member

Banned

Similar threads