Support NeoGAF

z0m3le · Jan 29, 2013

tipoo said:
We can, yes. If we know the architecture family, number of shader cores, and the clock speed, we can get pretty close to the theoretica Gflop count based on other chips in that architecture. That's what I've been saying.

I don't even think we need to know the architecture family for that, all modern AMD GPUs are:

(core clock/1000) x shaders x 2 = GFLOPs. R700 through GCN all use that formula I believe.

phosphor112 · Jan 29, 2013

Durante said:
There's very little mystery to Cell performance (including PPE and the SPEs), there are dozens of scientific papers from half a decade ago dedicated to the topic.

Amazing device... it's actually very bandwidth limited... imagine what kind of processing we could have got with more than 25GB/s.

blu · Jan 29, 2013

z0m3le said:
I don't even think we need to know the architecture family for that, all modern AMD GPUs are:

(core clock/100) x shaders x 2 = GFLOPs. R700 through GCN all use that formula I believe.

How do you intend to acquire the bolded number if you know nothing about the die layout?

Thraktor · Jan 29, 2013

While people are waiting on all the GPU stuff to happen, I have a few "real world" performance numbers on the CPU that may be worth discussing. Lostinblue posted some DMIPS numbers above, and it reminded me that I'd done a few calculations in that regard already. Dhrystone MIPS (DMIPS) is a CPU performance benchmark that's been around for a long time, and is commonly used enough that it's at least fairly easy to get Dhrystone values for a variety of different CPUs, so it'll do as a point of comparison for Espresso.

According to this website, the PPC750 derivative in the Gamecube managed 1125 DMIPS, and ran at 485MHz, so that gives a DMIPS/MHz value of 2.32, which we can use as a loose measure of efficiency of the cores used in the GC, Wii and Wii U. There don't seem to be numbers on Jaguar DMIPS performance (as it isn't out yet), but I found this page with a benchmark of 5325 DMIPS for a dual-core Bobcat chip running at 1GHz. This comes to 2.68 DMIPS/MHz, so a bit more efficient per clock than the 750, but not massively so. Of course, the Jaguar chip should be more efficient (AMD claims 15% improvement per clock), so it might top out at around 3 DMIPS/MHz. We don't know what changes Nintendo and IBM have made to Espresso's cores over the Gekko/Broadway design, though, so it might be somewhat more efficient as well, but I wouldn't expect anything massive (big changes to the architecture would make BC tricky). Let's stick to a conservative 2.32 DMIPS/Mhz for Espresso's cores for now, though.

So, we're looking at about 8,700 DMIPS for three 1.25GHz Broadway cores, versus about 38,400 DMIPS for the eight 1.6GHz Jaguar cores expected in PS4/XBox3. If MS are dedicating two of the Jaguar cores to the OS, then there's about 28,800 DMIPS of performance from the remaining 6.

How does this compare to the last generation? If lostinblue's numbers are correct, then the Cell PPE managed 1879.63 DMIPS at 3.2GHz, or 0.59 DMIPS/MHz. The XBox360's Xenon was effectively just 3 of these, so that comes to 5,639 DMIPS total (35% worse than the 1.25GHz Espresso). If anyone has numbers on Cell's SPEs, I'd be happy to add them in here.

Now, of course Dhrystone is only one benchmark, and only really tests certain things, so you're going to get different results with different kinds of code. I'm also using estimates here, so further customisation of Espresso's cores may boost its Dhrystone performance, for example (but don't expect much, I'd be very surprised if it were over 10,000 DMIPS). Nonetheless, hopefully this illustrates how clock speeds are far from the main factor when comparing CPUs' performance.

Thraktor · Jan 29, 2013

z0m3le said:
With what someone in the know said about the shader count being odd, I half expect GCN's 64 shader CUs, with something unique like the basic 384 shader count but with another CU or two.

Given that Latte's using a VLIW5 architecture, a count of 384 is impossible. The clue's in the name, as they're grouped in arrays of 5 "shader cores", so the total must be a multiple of 5.

Fourth Storm · Jan 29, 2013

pestul said:
And Fourth Storm was never heard from again..

Ha! I'm pretty sure you guys could hunt me down and feed me to the Pikmin if I tried pulling that. As Thraktor said, I'm waiting for the funds to clear at this point. Like Iwata in the recent ND, I must graciously ask for your patience in the meantime.

I also appreciate the gesture from all the would-be contributors. The short time it took to reach our goal was quite impressive - but I should have expected nothing less from Gaffers. Bravo, people!

Thraktor · Jan 29, 2013

Fourth Storm said:
Ha! I'm pretty sure you guys could hunt me down and feed me to the Pikmin if I tried pulling that. As Thraktor said, I'm waiting for the funds to clear at this point. Like Iwata in the recent ND, I must graciously ask for your patience in the meantime.

I also appreciate the gesture from all the would-be contributors. The short time it took to reach our goal was quite impressive - but I should have expected nothing less from Gaffers. Bravo, people!

Also, anyone who gave you money over Paypal now has your address, so exacting some vengeance shouldn't be too difficult

Fourth Storm · Jan 29, 2013

Thraktor said:
Also, anyone who gave you money over Paypal now has your address, so exacting some vengeance shouldn't be too difficult

Just be prepared - my apartment might be merely a front for a vast subterranean complex.

Earendil · Jan 29, 2013

If we decide to get the CPU shots, I should be able to pitch in a little. It depends on when my mortgage goes through, I have the bank going over my checking account with a fine tooth comb right now.

Thraktor · Jan 29, 2013

Fourth Storm said:
Just be prepared - my apartment might be merely a front for a vast subterranean complex.

Oh, don't worry, I'm doing my research...

z0m3le · Jan 29, 2013

blu said:
How do you intend to acquire the bolded number if you know nothing about the die layout?

well, I would take a guess that the most repeated structure on the die would be shaders, counting them would give you a total, I'm obviously not the person to analyze a die layout, but I would assume that would be it, and I wouldn't confuse it with the dream hopefully because I expect the gpu to take up the majority of the area.

Also it should resemble vliw5 layouts, making this a bit easier. (thanks to who pointed that out)
Thraktor, was the above confirmed? I thought vliw5 was simply a guess based on early leaked r700 gpu.

ozfunghi · Jan 29, 2013

z0m3le said:
I don't even think we need to know the architecture family for that, all modern AMD GPUs are:

(core clock/100) x shaders x 2 = GFLOPs. R700 through GCN all use that formula I believe.

???

/1000 surely?

pixlexic · Jan 29, 2013

Just wanted to add that there isn't a major difference writing code for x86 vs PPC. Just about all of that is taken care of in the compilers.

The major factor would have been if the orbis and durangos CPU hardware architecture was different than the wii us and it is not. They all work on the same principles.

pestul · Jan 29, 2013

Fourth Storm said:
Ha! I'm pretty sure you guys could hunt me down and feed me to the Pikmin if I tried pulling that. As Thraktor said, I'm waiting for the funds to clear at this point. Like Iwata in the recent ND, I must graciously ask for your patience in the meantime.

I also appreciate the gesture from all the would-be contributors. The short time it took to reach our goal was quite impressive - but I should have expected nothing less from Gaffers. Bravo, people!

I know.. was actually refering to the Nintendo ninjas having silenced you.

megabytecr · Jan 29, 2013

Just wanted to thank the major contributors in this thread! I am in the IT field but not in hardware side or low level software. Even though I am no expert, discussion has been very entertaining and I have been following the thread and even learning some stuff, very educational. Can't wait for the info in the next couple of days.

z0m3le · Jan 29, 2013

ozfunghi said:
???

/1000 surely?

yes sorry, I'll edit it so that if someone tries to use it to figure out GFLOPS in the future it is sound.

deviljho · Jan 29, 2013

Fourth Storm said:
Just be prepared - my apartment might be merely a front for a vast subterranean complex.

lol

pestul said:
I know.. was actually refering to the Nintendo ninjas having silenced you.

LOL

OryoN · Jan 29, 2013

Thraktor said:
...Now, of course Dhrystone is only one benchmark, and only really tests certain things, so you're going to get different results with different kinds of code. I'm also using estimates here, so further customisation of Espresso's cores may boost its Dhrystone performance, for example (but don't expect much, I'd be very surprised if it were over 10,000 DMIPS). Nonetheless, hopefully this illustrates how clock speeds are far from the main factor when comparing CPUs' performance.

Thanks for sharing. I was just reading that the 750CL (pretty much what's in Wii, and what espresso is likely to be based on) handles 4 instruction per clock, vs 2 of the Jaguar cores. That should help close the performance gap a decent bit, no?

http://datasheets.chipdb.org/IBM/PowerPC/750/PowerPC-750CL.pdf
(*see "general information")

Also,what about shorter pipeline stages? How does that come into play in helping Espresso cores get a bit more out of performance? Is there a downside to a short pipeline other than lower clock speeds? Or, is the a plus to having longer stages, other than higher clock speeds?

pepone1234 · Jan 29, 2013

OryoN said:
Thanks for sharing. I was just reading that the 750CL (pretty much what's in Wii, and what espresso is likely to be based on) handles 4 instruction per clock, vs 2 of the Jaguar cores. That should help close the performance gap a decent bit, no?

http://datasheets.chipdb.org/IBM/PowerPC/750/PowerPC-750CL.pdf
(*see "general information")

Also,what about shorter pipeline stages? How does that come into play in helping Espresso cores get a bit more out of performance? Is there a downside to a short pipeline other than lower clock speeds? Or, is the a plus to having longer stages, other than higher clock speeds?

Well, if I am not mistaken the powerpc architecture is RISC and the x86 used by the jaguar cores is CISC so comparing instructions per clock could be not so easy.

Carpe Diem · Jan 29, 2013

You guys are AWESOME:::

Schnozberry · Jan 29, 2013

pepone1234 said:
Well, if I am not mistaken the powerpc architecture is RISC and the x86 used by the jaguar cores is CISC so comparing instructions per clock could be not so easy.

RISC processors handle simple math functions much faster. Instructions on RISC processors are broken down into small opcodes that can be executed in one clock cycle. CISC processors handle complex math faster, as they can handle more data simultaneously. The IPC difference may be an advantage to Espresso with certain types of code, and a wash or perhaps even a detriment for others.

One thing that can be said for the Wii U CPU is that it is incredibly power efficient for the performance it is putting out. It certainly seems to punch above it's weight, as it were. That's probably not much solace for those who are looking for competitor to much more expensive gaming PC's, but I think it's great news for Nintendo in the long run because they'll be able to reach economies of scale much quicker with less expensive hardware, which will lead to lower prices and faster adoption by the mainstream. A lot of people forget that part of the Wii's early success was it's price. $250 always seems to be the price at which the mass market starts biting on upgraded hardware.

If Nintendo unifies it's SKU's at $299 and releases Kart and 3D Mario and maybe another surprise for the next holiday, Durango and Orbis will need to bring the thunder in terms of price and software to grab outside the early adopter crowd.

EDIT: What I mean by unifying it's SKU's is dropping the basic set and blowing them out for $249, and lowering the Deluxe set to $299 and leaving that as the only production SKU going forward.

Popstar · Jan 29, 2013

tipoo · Jan 29, 2013

z0m3le said:
I don't even think we need to know the architecture family for that, all modern AMD GPUs are:

(core clock/1000) x shaders x 2 = GFLOPs. R700 through GCN all use that formula I believe.

You would see compute units on the die, not individual shaders. So you would need to know which architecture it is to know how many shaders they use per cluster. You could probably tell the architecture from the die photo as well.

Maybe with a super high res picture like this one you could see individual shaders, I don't know, but I've never seen that. Anyways, counting out a few and then doing simple multiplication sounds like more fun than counting out hundreds of shaders at any rate

lightchris · Jan 29, 2013

Schnozberry said:
Instructions on RISC processors are broken down into small opcodes that can be executed in one clock cycle.

Err.. no. You might be mixing this up with what Popstar said: Modern "CISC" CPUs (which are all RISC in the inside) divide the CISC instructions into several micro-ops. Native RISC CPUs don't need to do that.

Schnozberry said:
One thing that can be said for the Wii U CPU is that it is incredibly power efficient for the performance it is putting out.

I wouldn't say that. More power efficient than the big x86 CPUs or Xenon, maybe. But certainly not more than Jaguar.

lostinblue · Jan 29, 2013

Durante said:
There's very little mystery to Cell performance (including PPE and the SPEs), there are dozens of scientific papers from half a decade ago dedicated to the topic.

Oh, c'mon.

There wasn't any mystery with the Gekko either, but we've got benchmarks like DMIPS performance and GFlops.

Plus, I reckon that lousy general purpose performance came as a surprise for a lot of people back then (in fact it still comes as a surprise to some today); and said numbers were never really published by Sony themselves, even if anyone in the know knew that said performance wouldn't be anything to write home about (and for that I agree, I knew, you certainly knew as well, they didn't keep the real nature of the chip a secret, but they kinda tried sidestep the disadvantages whilst singing the "supercomputer" prayers); yet we couldn't really position it accurately without benchmarks.

Basically I'm saying such benches have their place.

Thraktor · Jan 29, 2013

OryoN said:
Thanks for sharing. I was just reading that the 750CL (pretty much what's in Wii, and what espresso is likely to be based on) handles 4 instruction per clock, vs 2 of the Jaguar cores. That should help close the performance gap a decent bit, no?

http://datasheets.chipdb.org/IBM/PowerPC/750/PowerPC-750CL.pdf
(*see "general information")

Also,what about shorter pipeline stages? How does that come into play in helping Espresso cores get a bit more out of performance? Is there a downside to a short pipeline other than lower clock speeds? Or, is the a plus to having longer stages, other than higher clock speeds?

Well, it fetches up to four instructions per clock, dispatches up to two, executes anywhere up to six and completes (I would assume) a similar number. Now, it's not going to be doing all that every clock cycle, and performance is going to be limited by whichever bottleneck is affecting the particular code you're running on it. That's why benchmarks like Dhrystone are a simpler comparison, as they give a sort of "typical" performance figure without having to understand the micro-architectural peculiarities of every chip you're looking at, and having to figure out how exactly they affect code.

In terms of the shorted pipeline, the main advantages will be when it comes to code with a lot of unpredictable branching (AI would be the easiest example of this), as branch mispredict penalties are a lot cheaper. Of course, Jaguar doesn't really have an excessively long pipeline, and likely has better branch prediction hardware, so Espresso's advantage there wouldn't be huge. The downside to the shorter pipeline is simply the attainable clock-speed.

Schnozberry said:
RISC processors handle simple math functions much faster. Instructions on RISC processors are broken down into small opcodes that can be executed in one clock cycle. CISC processors handle complex math faster, as they can handle more data simultaneously. The IPC difference may be an advantage to Espresso with certain types of code, and a wash or perhaps even a detriment for others.

This isn't entirely true these days. Back in the 80s, RISC CPUs genuinely had smaller instruction sets than comparable CISC CPUs, but now you have RISC chips like the Power7 with hugely complex instruction sets. The real difference between the two is that RISC CPUs have what is called a "load-store architecture". That is, RISC CPUs can't execute instructions directly on data in memory, they have to load it into a register first, then store it back into memory after. CISC CPUs, by comparison, could just operate directly on the data without having to load it into the register. The problem with doing this is that these instructions actually use up a little bit more electricity than instructions operating on registers, as the data is travelling a much farther distance. As a result, RISC CPUs are typically more energy efficient, which brings me to my next point:

Jaguar, like Atom, isn't strictly a CISC CPU. It runs the CISC x86-64 instruction set, but it does so by converting the instructions on-the-fly into what are called micro-ops, which are functionally a RISC instruction set. This is done in order to keep energy consumption as low as possible (remember that this is AMD's low-power CPU line). The RISC/CISC distinction, then, between Espresso and Jaguar is mostly an illusion, as the the basic principles behind their operation are largely the same.

Schnozberry · Jan 29, 2013

Thanks for the education, Thraktor.

lostinblue · Jan 29, 2013

lightchris said:
I wouldn't say that. More power efficient than the big x86 CPUs or Xenon, maybe. But certainly not more than Jaguar.

It actually might, the PPC 750 pipeline is really short, I reckon 7 stages or so. That makes it really efficient per clock for what it is. Energy consumption-wise also.

i3/i5/i7 are 15-17 stages I think (like Pentium 3 and Pentium M were), as is AMD technology I believe.

As for Pentium 4 architecture, that reached 32 stages with the prescott's, more stages in the pipeline enables more frequency, but also reduces performance, which is why early Northwood CPU's actually performed better than the Prescott at the same speed (prescott had an even further elongated pipeline in order to accomodate more cache. Oh, this whole huge staged pipeline thing was the reason it took a beating from much lower clocked AMD cpu's back then; if not they're variant or evolved technologies still designed around that "core"; in laymen terms larger pipeline just means information takes more time to get somewhere.

PS3's Cell and Microsoft's Xenon is also a elongated pipeline design (more than 30 stages), except it's stripped down of some complexity (further killing it in general purpose) and has no cache miss prediction (which happens 5% of the time if I reckon correctly); so it's like having Pentium 4 but worse. Pentium 4's wasted a lot of energy too; the smaller pipeline designs of today are way more power efficient.

So yeah, the one thing we know for sure is that said cpu should be very power efficient; perhaps more than Jaguar.

The thing Jaguar has going for it is the APU architecture taking ahold of the traditional FPU unit. In that regard it certainly destroys the Wii U; the rest, it's as said, not incredibly more efficient per clock in operations per MHz, but higher clocked (because clocking that PPC750 architecture up is a bitch), and featuring more cores. Note that PPC750 has been used in routers and stuff meant to waste as little energy as possibly for well over a decade (well, and it kinda fell off the map when Freescale decided to not core shrink it further, but point stands). At 45 nm's it should be very power efficient to the point I believe they should have added more cores just to be safe.

EDIT: I didn't realize we were talking pipeline stages already, nice!

Thraktor · Jan 29, 2013

lostinblue said:
EDIT: I didn't realize we were talking pipeline stages already, nice!

Yeah, pipelines have come up now and again in the thread, as you might expect when answering the question "What's the deal with using 750-based cores?" Incidentally, the 750 series has the classic 4-stage PPC pipeline, with up to two extra stages for floating point ops.

blu · Jan 29, 2013

Thraktor said:
This isn't entirely true these days. Back in the 80s, RISC CPUs genuinely had smaller instruction sets than comparable CISC CPUs, but now you have RISC chips like the Power7 with hugely complex instruction sets.

RISCs to this day have simpler ISAs for the simple reason that their instruction encoding is fixed length, vs variable-length encoding for CISC ISAs. The two approaches have their pros and cons, the major among which are:

CISC

pro: variable instruction length in theory allows better code density - not all instructions have to be the same length, so more common instructions can be encoded in shorter sequences. As a result, better I$ utilisation is possible.
con: the most popular CISCs out there - the x86 family, do not actually encode their instructions in any sane way, because 'legacy' - you get 8086 single-byte ops nobody uses today, and quintessential modern mov ops spanning many bytes. I haven't bothered to check if the situation has radically improved in x86-64.

RISC

pro: fixed-length op encoding makes op decoding orders of magnitude easier (read: less transistors are spent on op decoding) - an op can be only n-bytes long (and not longer than a machine word), so op boundaries come at multiples-of-n addresses, you always know how many bytes n ops take so you never end up with fragments of instructions, etc.
con: fixed-length means you would have worse I$ utilisation than a good variable-length ISA. So RISCs get to cheat there - 64bit RISCs still have 32bit op length, and sometimes even 16bit op length. Sometimes both, like most ARMs which have a full-fledged 32bit-op-length ISA and a 16bit-op-length Thumb ISA - a reduced opcode set, and the code can switch between the two at the drop of a jump (please note that's op encoding length that is being discussed here, not 'ISA bitness' dictating register size, etc - Thumb is still a 32bit ISA).

The real difference between the two is that RISC CPUs have what is called a "load-store architecture". That is, RISC CPUs can't execute instructions directly on data in memory, they have to load it into a register first, then store it back into memory after. CISC CPUs, by comparison, could just operate directly on the data without having to load it into the register. The problem with doing this is that these instructions actually use up a little bit more electricity than instructions operating on registers, as the data is travelling a much farther distance. As a result, RISC CPUs are typically more energy efficient..

Actually, the full picture is a bit more complicated than that.

RISC's load/store model originally allowed L/S categories of ops to have their own dedicated pipelines - the rest of the CPU pipelines do not have to stall on a L/S, unless data dependency is present, of course. With classic CISCs that don't have a load/store unit, an 'add x, [mem]' can stall a larger set of resources until that mem is delivered. Modern CISC CPUs don't make that mistake, of course, since, as you note, they're not really CISC on the inside.

tipoo · Jan 29, 2013

The PowerPC 750 was modified when it was used in the gamecube to add some basic SIMD capabilities (the Wii U processor is still just using paired singles, same as the gamecube and wii [I know this from Hector Marcans twitter]), does anyone know if its weak FPU performance was changed in the GC or Wii too?

The 7xx family had its shortcomings, namely lack of SMP support and SIMD capabilities and a relatively weak FPU

http://en.wikipedia.org/wiki/PowerPC_7xx
http://en.wikipedia.org/wiki/PowerPC_7xx#Gekko

Shame it doesn't have broader SIMD units though, this guy went from 40 to 120FPS optimizing for SIMD
http://blog.wolfire.com/2010/09/SIMD-optimization

lostinblue · Jan 29, 2013

Thraktor said:
Yeah, pipelines have come up now and again in the thread, as you might expect when answering the question "What's the deal with using 750-based cores?" Incidentally, the 750 series has the classic 4-stage PPC pipeline, with up to two extra stages for floating point ops.

6 staged then, I knew I was walking on thin ice there.

tipoo said:
The PowerPC 750 was modified when it was used in the Gamecube to add some basic SIMD capabilities (the Wii U processor is still just using paired singles), does anyone know if its weak FPU performance was changed in the GC or Wii too?

Probably not, if it did pretty sure the SIMD implementation would have been upgraded to VMX and it supposedly wasn't (claimed to not be the case at least). It's a shame, because an implementation like VMX128 would really come in handy. As is it seems like they just kept everything in place core shrinked it and clocked it higher.

The "50 SIMD instructions" of the Gekko were meant for 3D; they were the compression support for geometry and the like (Flipper didn't have vertex shaders, hence the CPU still had to do those tasks, so as to not take too much bandwidth bthey made it so that both CPU and GPU could handle compressed data between themselves. VMX128 a few years back was integrated onto the Xenon for the same purposes, I sure hope they at least increased the SIMD instructions this time around, but time will tell.

Still, it's most definitely mostly unchanged, and so, no FPU miracles there. But for a general purpose CPU the FPU is not so bad; I reckon a Pentium 4 had 6.4 Gigaflops @ 3.2 GHz, this part will be doing 4.9 GFlops per core @ 1.243 GHz if the increase is in line with the Gekko's 1.9 GFlops @ 485 MHz. It's not all that bad; a single core Cortex A9 does 2 Gigaflops @ 1 GHz.

blu · Jan 29, 2013

tipoo said:
The PowerPC 750 was modified when it was used in the gamecube to add some basic SIMD capabilities (the Wii U processor is still just using paired singles), does anyone know if its weak FPU performance was changed in the GC or Wii too?

http://en.wikipedia.org/wiki/PowerPC_7xx
http://en.wikipedia.org/wiki/PowerPC_7xx#Gekko

Gekko's FPU is not so weak. It does not have large SIMD register files or the ALU fields to back the former, but its scalar performance is excellent, and that includes a bunch of ops that the SSE bunch have not even heard of (like an actual MADD, for starters). If Gekko had 2x its current FPU resources it would've given a few SSE kids a good run for their money. Not unlike how AltiVec was making the early SSE look silly back in the day.

oversitting · Jan 29, 2013

Thraktor said:
Jaguar, like Atom, isn't strictly a CISC CPU. It runs the CISC x86-64 instruction set, but it does so by converting the instructions on-the-fly into what are called micro-ops, which are functionally a RISC instruction set. This is done in order to keep energy consumption as low as possible (remember that this is AMD's low-power CPU line). The RISC/CISC distinction, then, between Espresso and Jaguar is mostly an illusion, as the the basic principles behind their operation are largely the same.

Isn't this what every x86 cpu does these days?

Thraktor · Jan 29, 2013

Thanks blu. I wasn't aware that fixed/variable length op-codes were a RISC/CISC thing, interesting to know, and I'm willing to bet that x86-64 instructions aren't neatly huffman encoded (although I suppose embedded processors with custom instruction encoding for known workloads would be pretty interesting, that's largely off-topic). I meant to mention the pipeline stalls from loads/stores in CISC, but of course they're not really a thing any more anyway.

blu said:
Gekko's FPU is not so weak. It does not have large SIMD register files or the ALU fields to back the former, but its scalar performance is excellent, and that includes a bunch of ops that the SSE bunch have not even heard of (like an actual MADD, for starters). If Gekko had 2x its current FPU resources it would've given a few SSE kids a good run for their money. Not unlike how AltiVec was making the early SSE look silly back in the day.

I had a thought while looking at a 750 die shot earlier; what if Nintendo and IBM added an extra FPU to Espresso's cores? The 750 series (along with most PPCs since) does this with integer units, with one integer unit for simple instructions and and one for complex instructions. Given that Espresso's main limitation would likely be in the field of floating point performance, it might make sense to extend the concept to FPUs. The existing FPU (with paired singles) would be the complex floating point unit, and they could add a second smaller unit to handle basic single precision ops. The main benefit of such a design would be that it could improve floating point performance quite a bit, while maintaining full BC, and they wouldn't even have to touch the compiler. Nintendo also have a vast catalog of code from Wii and GameCube games that could be used to determine which instructions the simple FPU would be best used for.

It's just something that occurred to me, but seems to be in keeping with IBM's design philosophy and Nintendo's requirements, so it might be something to look for if we do end up taking a peek at the CPU.

lightchris · Jan 29, 2013

lostinblue said:
It actually might, the PPC 750 pipeline is really short, I reckon 7 stages or so. That makes it really efficient per clock for what it is. Energy consumption-wise also.

There's much more to per clock efficiency (or IPC) than pipeline depth. For example, it says nothing about the degree of superscalarity, about OoOE, branch prediction or SIMD units (all of which where Jaguar is probably superior).
Intel's latest Sandy/Ivy Brdige architectures are monsters when it comes to IPC despite the relatively deep pipeline, and Haswell will go even further.

lostinblue · Jan 29, 2013

lightchris said:
There's much more to per clock efficiency (or IPC) than pipeline depth. For example, it says nothing about the degree of superscalarity, about OoOE, branch prediction or SIMD units (all of which where Jaguar is probably superior).
Intel's latest Sandy/Ivy Brdige architectures are monsters when it comes to IPC despite the relatively deep pipeline, and Haswell will go even further.

I know; I'm not saying the PPC 750 is an up to date architecture, it isn't.

But it's still somewhat efficient for what it is. I'm putting the chance on the table that a Jaguar could take more energy even if it was a 3 core solution/implementation specially seeing they aren't Intel (they don't have the intel tri-gate technology and they aren't as far along with their low energy consumption road map either). As for performance, no contest, at most this chip will fall onto the "not so bad" area per core and even then it seems totally outnumbered, I fail to understand why didn't they go at least 4-core (and SMT support would be a blessing too).

Thankfully this console generation CPU's have been horrible in the things PPC750 was good at (general processing with OoOE support), so that will probably help it out a bit, as does the sound coprocessor taking that load off it.

tipoo · Jan 29, 2013

lostinblue said:
Still, it's most definitely mostly unchanged, and so, no FPU miracles there. But for a general purpose CPU the FPU is not so bad; I reckon a Pentium 4 had 6.4 Gigaflops @ 3.2 GHz, this part will be doing 4.9 GFlops per core @ 1.243 GHz if the increase is in line with the Gekko's 1.9 GFlops @ 485 MHz. It's not all that bad; a single core Cortex A9 does 2 Gigaflops @ 1 GHz.

Thanks for explaining.
It won't be up against low power generation old single core phone CPUs though of course, AMDs Jaguar has quite a few enhancements in this regard. The FPU units are now 128 bits wide, compared to 64 bits on Bobcat and Espresso (64-bit floating-point (or 2 × 32-bit SIMD, aka paired singles on Espresso)) . The Jaguar supports 256-bit AVX by breaking the operations into a pair of 128-bit uops. I'm guessing it will still have quite a lead, and that's even before factoring in the sheer number of cores and the increased clock.

lostinblue · Jan 29, 2013

tipoo said:
Thanks for explaining.
It won't be up against low power generation old single core phone CPUs though of course, AMDs Jaguar has quite a few enhancements in this regard. The FPU units are now 128 bits wide, compared to 64 bits on Bobcat. The chip supports 256-bit AVX by breaking the operations into a pair of 128-bit uops. I'm guessing it will still have quite a lead, and then factor in the sheer number of cores and the increased clock.

Oh, make no mistake, it'll be destroyed by these. Not in any small, relative way, huge.

Before you had two concepts, FPU, who started out as a expansion math chip on 286 and 386 and was integrated from 486 onwards (faulty FPU 486 chips were rebranded 486 SX) and later GPU; 3D-accelerated GPU was of course another type of math/floating point specialized unit meant for graphics, and very efficient at that, just going by the GFlop rating aside (and then looking at the MHz it's clocked at).

Now, AMD bought ATi and all those shenanigans and they did the Fusion/APU concept which is nuking the comparatively ineffective FPU unit and adding in an actual GPU part for that same task (along with others). I haven't really looked all that much into it, and I know it's more complex than this, but that's the base concept there.

So yeah, this chip is screwed to no end in that regard.

tipoo · Jan 30, 2013

Fwiw I just tweeted to Hector Marcan about the FPU and he replied

same FPU, as I said. The core is exactly the same, just more MHz.

Donnie · Jan 30, 2013

tipoo said:
Fwiw I just tweeted to Hector Marcan about the FPU and he replied

Has he seen a die shot though? Maybe I'm being pedantic but surely there can be optimisations to a core that couldn't be seen on the software side. Suppose you can benchmark and compare per mhz. But I doubt performance would be identical per mhz anyway due to the differing cache setup.

Certainly seems the cores are very similar anyway. The die size seems odd for 3x Broadway cores and some SMP circuitry (and slightly more memory transistors per core) on 45nm. But I suppose shrinking a CPU to a smaller process rarely works out close to how it should theoretically.

Zoramon089 · Jan 30, 2013

tipoo said:
Fwiw I just tweeted to Hector Marcan about the FPU and he replied

Yeah, has he seen the die? I thought no one had really seen it

Nostremitus · Jan 30, 2013

Wow, I'm gone for a couple of days and stuff happens...

Thanks FourthStorm and all the contributors.

I've been thinking, and mock me if this makes no damned sense... but...

I'm under the impression that Marcan's info if from hacking wii mode, correct?

If so, wouldn't two of the cores be disabled for bc sake?

iirc, it's rumored that Espresso has asymmetrical cores. Is it possible the one of the cores (the one used for wii mode) was kept as close to vanilla Broadway as possible for bc while the others house any changes that may have been added?

Like I said, mock if this is just stupid, but I'm curious if it's even possible.

ozfunghi · Jan 30, 2013

Nostremitus said:
Wow, I'm gone for a couple of days and stuff happens...

Thanks FourthStorm and all the contributors.

I've been thinking, and mock me if this makes no damned sense... but...

I'm under the impression that Marcan's info if from hacking wii mode, correct?

If so, wouldn't two of the cores be disabled for bc sake?

iirc, it's rumored that Espresso has asymmetrical cores. Is it possible the one of the cores (the one used for wii mode) was kept as close to vanilla Broadway as possible for bc while the others house any changes that may have been added?

Like I said, mock if this is just stupid, but I'm curious if it's even possible.

I'm sure someone more knowledgeable will jump in, but he was able to determine what the clockspeed was... which certainly wasn't "Wiimode frequency" either. Or is it possible that the cpu will still run at the same frequency in Wiimode? Which would mean some wii games that struggled might struggle less on WiiU?

frankie_baby · Jan 30, 2013

Tron#1 said:
yeah ive message him on twitter and he is stuck on the CPU is just 3 broadways stuck together and call it a day. If you read the quotes from the Iwata ask on Wii U hardware it doesnt sound like that. they seem proud of the CPU "TEAM NINTENDO" made for the console. yet Iwata has said it isnt a massively powerful CPU... but i believe it will be one that just does what its suppose to do. nothing special just a nice CPU to pair with a very decent and modern GPU.

hasn't he said before though that it could've had extras (specifically 750fx features)

frankie_baby · Jan 30, 2013

Tron#1 said:
yeah he has said its like a hybrid of FX and GX... but he also has stated many time its basically just 3 broadways. IMO thats not the best verbiage to use as that makes the CPU seems as weak as it can be. there have been many people post on his twitter they cant believe nintendo just put 3 Wii CPU's together and called it a day. he "IMO" has done more harm than good in this discussion though i guess he is very respected as a hacker.

cant say i've ever paid any attention to his (or any) hacking efforts and this wii u info is the only time i've heard his name but from all this he sounds a bit of a dick

tipoo · Jan 30, 2013

Nostremitus said:
I'm under the impression that Marcan's info if from hacking wii mode, correct?

If so, wouldn't two of the cores be disabled for bc sake?

iirc, it's rumored that Espresso has asymmetrical cores. Is it possible the one of the cores (the one used for wii mode) was kept as close to vanilla Broadway as possible for bc while the others house any changes that may have been added?

Like I said, mock if this is just stupid, but I'm curious if it's even possible.

No, he's responded to that a bunch of times, the hack is for Wii U mode. The Wii U in Wii mode IS a Wii, it would show a single ~700Mhz core in that mode. He's seeing the clock for regular mode.

I suppose there could be optimizations to the FPU that a hacker can't see, but he seems pretty confident that the core is pretty much the same as the Wii cores, and I'd tend to believe him. Guessing that there are some invisible optimizations is just that, a guess, and a stretch IMHO.

As for the work on the CPU Nintendo seemed so proud of mentioned above, well, they did get an old architecture with no multicore support to run three cores, plus clocking it higher than it ever has before, plus all the work needed to shrink it down to 45nm, plus getting it to use eDRAM which it never has before etc. But those are mostly uncore components.

Durante · Jan 30, 2013

Tron#1 said:
yeah he has said its like a hybrid of FX and GX... but he also has stated many time its basically just 3 broadways. IMO thats not the best verbiage to use as that makes the CPU seems as weak as it can be. there have been many people post on his twitter they cant believe nintendo just put 3 Wii CPU's together and called it a day. he "IMO" has done more harm than good in this discussion though i guess he is very respected as a hacker.

That seems a strange thing to say about the one person that has provided what is easily the most significant, useful contribution to our understanding of the Wii U hardware since we got the first teardowns and RAM markings.

tipoo · Jan 30, 2013

Tron#1 said:
he "IMO" has done more harm than good in this discussion though i guess he is very respected as a hacker.

Well, he also explained to multiple people how the Wii U has higher IPC than the PS3 and 360, and he made a very significant contribution to what we know (the clock speed of the CPU and GPU, plus a more solid piece of evidence on what architecture the CPU is based on). Information is always good, and he's a well respected hacker so I don't think he's lying for fame, just because a few people might get misinformed isn't his or anyone elses fault, it's their misunderstanding of technology.

And by the way the "duct taped together" thing is a common Nintendo meme that people used to describe the Wiis power in relation to the GC, I think he meant that to be a joke, but it holds some water.

tipoo · Jan 30, 2013

Tron#1 said:
i just kinda wish we all knew how he got those numbers. like its just so easy to say yeah here is what the CPU and GPU are running at and not give anymore information. I'm not a hacker so i dont know... i just wish we knew more about those numbers he got.

I don't claim to know completely but he's staking his reputation on it, and he's pretty well known. He said the Wii U security core was very similar to the Wii one, and bypassing that and getting access to the CPU is how we now know the clock speed of the Wii since Nintendo never said that either. So I imagine it's the same case here.

And he knows it's binary compatible with the 750 due to the kinds of instructions it will accept

Besides, I wasn't expecting clocks any higher, the entire machine draws no more than 33 watts.

Support NeoGAF

WiiU technical discussion (serious discussions welcome)

Banned

Banned

Wants the largest console games publisher to avoid Nintendo's platforms.

Member

Member

Member

Member

Member

Member

Member

Banned

Member

Banned

Member

Member

Banned

Member

Member

Neo Member

Banned

Member

Member

Banned

Member

Banned

Member

Member

Banned

Member

Wants the largest console games publisher to avoid Nintendo's platforms.

Banned

Banned

Wants the largest console games publisher to avoid Nintendo's platforms.

Banned

Member

Member

Banned

Banned

Banned

Banned

Member

Banned

Member

Member

Member

Member

Banned

Member

Banned

Banned

Similar threads