I asked about this because I'm wondering about the feasibility of a future Nintendo product that has the Wii U's chipset on the actual Gamepad... with the game disc-data streaming from the console or even over the internet. Call it 'Wii Cloud', if you will.
A question of mine.. What is the little 3rd chip? A guy in a forum said "is a helping gpu" that also original Wii got?? Hmmm I think he recalled a name like "starlet" ? Is it accurate? or its something else.
It's flash memory. The Wii had it as well, but I believe Nintendo used an EEPROM instead of flash back then. No idea what's stored in there. Maybe license keys for eShop games or something.A question of mine.. What is the little 3rd chip? A guy in a forum said "is a helping gpu" that also original Wii got?? Hmmm I think he recalled a name like "starlet" ? Is it accurate? or its something else.
A question of mine.. What is the little 3rd chip? A guy in a forum said "is a helping gpu" that also original Wii got?? Hmmm I think he recalled a name like "starlet" ? Is it accurate? or its something else.
Yeah, Anand from Anandtech said the Intel HD2500 graphics were about half as powerful as the 360s, and the HD4000 was more powerful. So with Haswell, even integrated graphics low enough power for a tablet will likely be more powerful.
Haswells GT3e with embedded DRAM on the GPU (sound familiar?) should post even bigger leads.
The package size is the same, but that's no way to tell whether the die has shrunk or not (particularly as there are actually two dies in the package). The Wii Mini has both noticeably lower power consumption and a noticeably smaller cooling system, and a die shrink of just Broadway wouldn't be able to explain that.
Surely I'm not the only one noticing the column of capacitors & a bunch of resistors in the top right missing for Wii Mini.
How did Anand come to the conclusion that HD2500 is about half as powerful as 360? Isnt that GPU about 53gflops when clocked at its maximim 1100mhz (for the mobile version)? With the HD4000 being around 140gflops. Or am I missing something?
Pardon me for being stubborn, I know it's an MCM, but without seeing a size reduction I'm not convinced about a core shrink, on the surface it looks the same.The package size is the same, but that's no way to tell whether the die has shrunk or not (particularly as there are actually two dies in the package). The Wii Mini has both noticeably lower power consumption and a noticeably smaller cooling system, and a die shrink of just Broadway wouldn't be able to explain that.
gaf + chipworks = best pair. can't wait till those articles are made.
HD4000 = 2 ROPS, 25.6gbps memory bandwidth, 1.15ghz 16EU gpu cores, ~256gflops single precision iirc? That would put it just above the 360 (ignoring higher efficiency from new shaders completely)
Maybe you're thinking the old HD3000?
Think about this, even the GPUs in the new iPad are hitting 77Glfops, 53 for the 2500 sounds way too low. Half of 256 I would believe, if it scales perfectly with the EU cores.
http://www.anandtech.com/show/6472/ipad-4-late-2012-review/4
I was going with:
EU * 4 [dual-issue x 2 SP] * 2 [multiply + accumulate] * clock speed
Which I'm pretty sure was correct for Gen 6 Intel graphics, Gen 7 must be different?
It's capable of issuing three instructions per clock cycle into six independent execution units. But it's not that linear, never is.1. Do we know how many instructions it executes per cycle?
Source: http://raid-faq.narod.ru/doc/750_ts.pdf
• Up to four instructions can be fetched from cache per clock cycle
• As many as six instructions can execute per clock (including two integer instructions)
• Single-clock-cycle execution for most instructions
• Maximum three instruction dispatch per clock cycle
• Processes one branch per cycle and can resolve two speculations
• Completion unit retires as many as two instructions per clock
Yes.2. When it executes them, say for 3 per cycle for exmaple, is that 3 on each individual core?
It depends on the software you're using and how it is optimized. For you basically have two concurring approaches. Havok doesn't overly rely on GPU, although it does support it (more about that in a while) while PhysX by nature does rely on GPU but can run on CPU as well.3. What are the physics capabilities of this CPU? I ask because I remember people saying that the Wii U's CPU wouldn't be able to do physics very well but I remember a launch Wii game that did this them exceptionally well.
If we go back to pure "at a glance" performance increase the CPU should be 6X stronger minimum.
GFlop throughput is different, so it can basically do less calculations. Thankfully the GPU can do more of those calculations and with less penalty than on X360.4. What would the Wii U CPU "not" be able to do that the 360 CPU could and vice versa?
Even if they could, don't hold your breath.5. Is it possible to up the clock on the CPU by voltage stepping? In other words, could Nintendo release a firmware update that ran the CPU at a higher clock?
Game code was most certainly a wreck as they're not exactly tech gods whose games push whatever (quite the opposite). And most likely on top of sloppy... optimized for the 2-way SMT nature of PS3/X360; that or optimized for floating point performance seeing I do believe PS3 is lead platform for that one and it lacks general purpose overhead in no small way, they wouldn't be the first one's stuffing AI into SPE's.6. How much of a problem would the CPU be for AI? I recall Samurai Warriors 3 and Sengoku Basara 3 running extremely well on the Wii but Warriors Orochi Hyper running terrible, though it didn't appear to be doing much more than those 2 games.
In regards to current generation porting it's mostly a matter of the more you optimize/write around the quirks of a certain architecture the harder it is to port something.I wonder just how much of the problems we've seen in Wii U ports are the result of specs and how much are the result of developer effort(or lack there of).
I know the PS4 and supposedly Durango are x86 based, so porting between the two and PC is more straight forward. Do you think the Wii U is much more difficult?For instance, Need for Speed Most Wanted is better off being based on the PC version than on the PS3/X360 implementations, seeing cpu-wise the architecture is more traditional, hence Criterion not going through the pains Tekken and Warriors Orochi dudes went through.
Shouldn't be, the cpu is very straightforward and that means it's very industry standard, that means everybody is coding primarily with it in mind.I know the PS4 and supposedly Durango are x86 based, so porting between the two and PC is more straight forward. Do you think the Wii U is much more difficult?
Shouldn't be, the cpu is very straightforward and that means it's very industry standard, that means everybody is coding primarily with it in mind.
On top of all that it's very what you see is what you get, with good coding, performance is very predicable with no problems like chronic cache miss or fpu lacking in precision (no code misbehaving/need to check for coherency), with the pipeline being so short cycles are really fast and in the advent of stalling/crash pipeline take much time to clean.
Other than good coding the only thing really specific in there is paired singles, needed to achieve the best possible floating point performance. On the other hand it's missing some modern commonalities like 256-bit wide AVX (SIMD) and the new found focus on multithreading.
AMD Jaguar has both, cluster-based multithreading and AVX. Multithreading can't be acounted for as standard on pc code because cpu's as recent as core 2 duo (and most AMD cpu's) lack it, and AVX would really help Wii U floating point performance (as it did for core i5/i7 and AMD Jaguar). Despite these being hese are closed platforms though, so specific optimization is viable.
That (floating point performance) and the number of cores available (3 vs 8) really sets these systems apart, porting up should be easy, porting down depends. Code expecting floating point performance is gonna have a hard time code meant for more than 3 cores too.
Other than that it's good, there's talk of PS4/X720 reserving two cores for cpu, so that brings it down to (3 vs 6). Of course Wii U being x86 would also help though, recompiling for other architecture is not an issue, but if somewhere down the road third part support doesn't get off the ground and middleware PPC supoort gets deprecated Wii U will forever be bundled with current gen platforms and technology, despite the PPC part at hand having more in common with the next gen x86 solutions than PS3/X360 in order execution+smt. Developers doing middleware recompilations on their own is possible of course, but most simply won't bother.
All this to say, not a problem "if" Wii U manages to keep third part support (and as the lowest denominator gets to be lead platform) throughout the generation. Otherwise competition will probably top it by at least twice the general purpose performance making late/second thought ports potentially hard to accomodate. Parity between both other competitors also doesn't help, I guess.
It's capable of issuing three instructions per clock cycle into six independent execution units. But it's not that linear, never is.
So here:
Source: http://raid-faq.narod.ru/doc/750_ts.pdf
That's documentation for a regular PPC750 part, so paired singles might introduce a difference by doing 2x32-bit SIMD and being able to either dispatch or retire more per clock, I dunno.Yes.
How it behaves with multithreaded code though is up for anyone's guess, and by that I mean the intricacies of doing so because the CPU architecture originally wasn't meant for it. But it's probably pretty straightforward otherwise we would have heard about it by now.It depends on the software you're using and how it is optimized. For you basically have two concurring approaches. Havok doesn't overly rely on GPU, although it does support it (more about that in a while) while PhysX by nature does rely on GPU but can run on CPU as well.
Going by name, it seems to be really fast.I wonder when are we going to get a compiled write up for Expresso.
Hmm, that wasn't my point, but it can somewhat be claimed and yet somewhat it shouldn't.I just wanted to point out the bolded. Correct me if I'm wrong, but you can't say that Jaguar had 8 multi-threaded cores Vs Espresso's 3 single-threaded cores. can you?
That's seems true (I only don't say it is because I don't know what CBM stands for, cluster something? sorry for that) but other than the intricacy of the architecture being essentially dual-core pairs, paired into blocks, it still provides you with 8 cores at all times albeit cojoined; and Espresso provides 3.Jaguar doesn't have "real" multi-threading, it just pairs two cores together. So wouldn't Jaguar in PS4 really have 4 CBM cores (8 single threaded physical cores conjoined into pairs of 2) Vs Espresso's 3 single threaded cores?
The Jaguar is still an 8 core solution at all times. 8 cpu blocks would mean 16 cpu logic units though, albeit paired.I'm in no way implying that the Espresso is suddenly just as capable as the Jaguar 4 core CBM in total output, but not referencing this makes the gap look larger than it is. An 8 core CBM Jaguar would actually have 16 physical cores, right? That's a huge difference.
No, no. 8 threads at all times, 16 with CMT enabled.I guess what I'm saying is that PS4 has either 4 multithreading cores or 8 single thread cores, so if you're comparing advantages you can't list both at the same time. Then again, my knowledge of AMD chips is soley GAF based, lol.
No, look at the diagram."three instructions per clock cycle into six independent execution units." I do not understand this. 3x6? Does does this mean 3 instructions into each of the 6 units or does this mean that it can execute 6 times per cycle?
Yes, that would be the paired singles. I've looked into the PPC 750CL part (who has them) but saw no clear difference in specs.Also, is this the documentation of what the Nintendo processor versions can do? I thought it had customizations made to it beyond the standard processor.
Gotcha, I was under the impression that jaguar in PS4 has four CPU clusters with 2 cores each with the hyper threading being per cluster. There are multiple threads per cluster, but only because there are two single threaded cores in that cluster. Essentially just AMDs way of claiming hyper threading without actually having it, so to speak.Hmm, that wasn't my point, but it can somewhat be claimed and yet somewhat it shouldn't.
It shouldn't because it's not multithreading as we know it, I doubt you can hope to have a 30% boost, nor that there's that much a ceiling for 2-way to work here and make the cpu architecture that much more efficient.
But you're supposed to get double the threads for it, so it's either 8 threads or 16 real threads.That's seems true (I only don't say it is because I don't know what CBM stands for, cluster something? sorry for that) but other than the intricacy of the architecture being essentially dual-core pairs, paired into blocks, it still provides you with 8 cores at all times albeit cojoined; and Espresso provides 3.
I still reckon CMT as an advantage on a 8 core configuration, but like everything it might not be advantageous to use; even intel hyperthreading can actually slow things down if you use the second thread to run full load tasks on it, second thread is meant for overhead complementary stuff.
This implementation is a mystery in that regard, is second thread for complementary use or is it like splitting cpu load in half? Because on regular SMT you can at best get a 30% boost; if this is only dividing throughput that means you get double the threads and not necessarily more overhead for it.The Jaguar is still an 8 core solution at all times. 8 cpu blocks would mean 16 cpu logic units though, albeit paired.No, no. 8 threads at all times, 16 with CMT enabled.
At least if the CMT implementation is the same as on the Opterons; still, it would never decrease cpu threads.
Doesn't work that way.Gotcha, I was under the impression that jaguar in PS4 has four CPU clusters with 2 cores each with the hyper threading being per cluster. There are multiple threads per cluster, but only because there are two single threaded cores in that cluster. Essentially just AMDs way of claiming hyper threading without actually having it, so to speak.
CBM was my uneducated abbreviation of "cluster based multi threading"
Essentially, they can say they have cluster based multi threading because each cluster does have two threads... but each core only has one.
That was my impression, anyway.
Source: http://scalibq.wordpress.com/2012/02/14/the-myth-of-cmt-cluster-based-multithreading/The Opteron 6276 with CMT disabled has:
• 8 modules
• 8 threads
• 4 ALUs per module
• 2 ALUs per thread (the ALUs can not be shared between threads, so disabling CMT disables half the threads, and as a result also half the ALUs)
• 16 ALUs in total
With CMT enabled, this becomes:
• 8 modules
• 16 threads (double)
• 4 ALUs per module
• 2 ALUs per thread
• 32 ALUs in total (double)
So nothing happens, really. Since CMT doesn’t share the ALUs, it works exactly the same as the usual SMP approach. So you would expect the same scaling, since the execution units are dedicated per thread anyway. Enabling CMT just gives you more threads.
(...)
With single-threading, each thread has more ALUs with SMT than with CMT. With multithreading, each thread has less ALUs (effectively) than CMT.
And that’s why SMT works, and CMT doesn’t: AMD’s previous CPUs also had 3 ALUs per thread. But in order to reduce the size of the modules, AMD chose to use only 2 ALUs per thread now. It is a case of cutting off one’s nose to spite their face: CMT is struggling in single-threaded scenario’s, compared to both the previous-generation Opterons and the Xeons.
Awesome, thanks for the clarification. I had misunderstood. That is interesting.Doesn't work that way.
This is very interesting though:
Source: http://scalibq.wordpress.com/2012/02/14/the-myth-of-cmt-cluster-based-multithreading/
I highlighted the end because like last gen it seems that developers might have to optimize for specific features that change the whole code layout in order to compensate an architecture downfall on PS4/X720 (in this case for the enabled-CMT setting/more threads), that's an advantage for unoptimized code on Espresso, but then again having more cpu threads can be useful. It just won't apparently result in much of a performance boost in general terms/had other design decisions have been taken; but as it's code struggles (against previous solutions with more ALU's per thread) they pretty much have to go down that way.
I'm in no way implying that the Espresso is suddenly just as capable as the Jaguar 4 core CBM in total output, but not referencing this makes the gap look larger than it is. An 8 core CBM Jaguar would actually have 16 physical cores, right? That's a huge difference.
Gotcha, I was under the impression that jaguar in PS4 has four CPU clusters with 2 cores each with the hyper threading being per cluster. There are multiple threads per cluster, but only because there are two single threaded cores in that cluster. Essentially just AMDs way of claiming hyper threading without actually having it, so to speak.
CBM was my uneducated abbreviation of "cluster based multi threading"
Essentially, they can say they have cluster based multi threading because each cluster does have two threads... but each core only has one.
That was my impression, anyway.
Doesn't work that way.
This is very interesting though:
Source: http://scalibq.wordpress.com/2012/02/14/the-myth-of-cmt-cluster-based-multithreading/
I highlighted the end because like last gen it seems that developers might have to optimize for specific features that change the whole code layout in order to compensate an architecture downfall on PS4/X720 (in this case for the enabled-CMT setting/more threads), that's an advantage for unoptimized code on Espresso, but then again having more cpu threads can be useful. It just won't apparently result in much of a performance boost in general terms/had other design decisions have been taken; but as it's code struggles (against previous solutions with more ALU's per thread) they pretty much have to go down that way.
Expresso doesn't need to do multi-threading because it has eDRAM.
SMT is not CMT. I've commited that error before thinking they should be similar; for marketing purposes they might be so, but in the end not so.Neither the PS4 or Durango are expected to support SMT/"CMT". Espresso offers no advantage. In each system a single physical core only supports a single thread. And in the case of the PS4 each Jaguar core is faster than an Espresso core, and there are more of them running at a higher frequency.
No. PS4 has 2 4 core modules for a total of 8 physical cores. It neither supports, nor claims any kind of symetrical multithreading/hyperthreading.
Neither the PS4 or Durango are expected to support SMT/"CMT". Espresso offers no advantage. In each system a single physical core only supports a single thread. And in the case of the PS4 each Jaguar core is faster than an Espresso core, and there are more of them running at a higher frequency.
Multi-threading has nothing to do with the eDRAM.
Sorry if this has been asked before but what's the average and maximum power draw (in wattage) that can expected of this CPU?
Doesn't work that way.
This is very interesting though:
Source: http://scalibq.wordpress.com/2012/02/14/the-myth-of-cmt-cluster-based-multithreading/
I highlighted the end because like last gen it seems that developers might have to optimize for specific features that change the whole code layout in order to compensate an architecture downfall on PS4/X720 (in this case for the enabled-CMT setting/more threads), that's an advantage for unoptimized code on Espresso, but then again having more cpu threads can be useful. It just won't apparently result in much of a performance boost in general terms/had other design decisions have been taken; but as it's code struggles (against previous solutions with more ALU's per thread) they pretty much have to go down that way.
Multi-threading has nothing to do with the eDRAM.
It's dual issue. I don't know how many instructions it can possibly hope to retire by cycle, but it's either one or two; I'm leaning towards one, the whole point of 2-way SMT in there is that when one pipeline/issue stalls the other gets full priority after all, so unless they function concurrently (as in 50% overhead per thread) then it's most likely one.I've been wondering. Just how many instructions did the PS3 and 360 CPU's do per cycle "baseline"(ie. not counting other hardware features)?
Yes, certainly no more than 7W in full load.I believe someone said 7 watts earlier in the thread.
Does the recent talk of no UE4 on the Wii U mean that it's not capable of running it, or that Epic considers it too costly to scale it to Espresso? How do the DirectX capabilities of the system translate to UE4?
Does the recent talk of no UE4 on the Wii U mean that it's not capable of running it, or that Epic considers it too costly to scale it to Espresso? How do the DirectX capabilities of the system translate to UE4?
I honestly think it's laziness, given that wasn't a stated goal of UE4 that it would scale from smart phones to high end computers?
Personally, I think it's stupid on Epic's part. Fine. Don't give it flagship UE4 games, but get your engine out on as many devices as possible.
Laziness is the absolute worst fucking argument/defense/insult in these kind of threads.
It's good business, simple as that. If the Wii-U ends up selling gangbusters you'll see the support.
These guys are not leaning back in their office chairs with their feet up saying "Well, i'm supposed to port the UE4 engine to Wii-U, but i'd rather take a nap."
This would be hilarious if that were true.
I seems to me your terminology is getting astray here. A cycle is a cpu clock cycle. I.e. cycles = clocks. The relationship between clock and pipeline length is normally this: the higher the clock, the less work a pipeline stage can afford to do, ergo the greater the number of pipeline stages. The lower the clock, the more you can afford to do per stage, ergo the fewer stages. The big thing about the G3 in the WiiU is that it is among the shortest-pipeline architectures out there that reach this clock (1.25GHz). A short pipeline is generally better than a long one, per the same clock, at general purpsoe I.e. Espresso should perform better at GP than most other similarly-clocked CPUs out there, with similar super-scalar architecture. I don't know where you get the impression a 750 would outperform a 970 at GP, though - the latter is much more 'super'-scalar than the former, to put it metaphorically.It's dual issue. I don't know how many instructions it can possibly hope to retire by cycle, but it's either one or two; I'm leaning towards one, the whole point of 2-way SMT in there is that when one pipeline/issue stalls the other gets full priority after all, so unless they function concurrently (as in 50% overhead per thread) then it's most likely one.
PPC750 issues 3, retires 2.
This is not the be all end all though; if you had a PPC750 clocked at the same speed as a PPC970 (G5) the PPC750 would actually perform better in general purpose; but the PPC970 can issue 8 and retire 5. The time a cycle takes to complete varies, PPC pipeline is longer so cycles take longer to complete and that's why they opted to make it retire more per cycle to compensate for longer cycles per clock.
It has to do with the number of pipeline stages (the less there are the faster a cycle completion will be assuming same MHz), branch prediction type/effectivenes and cache miss. It's said cache miss amounts to a whooping 5% occurrence on PS3/X360, that means it clogs the pipeline and it takes a penalty (cycles where it's not available again, until it makes away with that pipeline crash, think of it as an accident on a highway). Anyway, everything helps, or rather on PS3 and X360 case everything is not really helping; it doesn't help that their pipelines are ~30 stages long for instance, cycles take more time to complete and cache miss "accidents" take more time to solve too (and the fact they happen 5% of the time is huge).
But all this to say, they're different beasts too, so it's hard to compare, but it's easy to see PPC750 is hugely more efficient per clock.Yes, certainly no more than 7W in full load.
Laziness is the absolute worst fucking argument/defense/insult in these kind of threads.
It's good business, simple as that. If the Wii-U ends up selling gangbusters you'll see the support.
On the clock I meant cycles per second.I seems to me your terminology is getting astray here. A cycle is a cpu clock cycle. I.e. cycles = clocks.
Benchmarks. At the same frequency PPC970 loses slightly against both G3 and G4 CPU's in GP. Of course though, G5/970 was much more scalable, that was the whole point for it.I don't know where you get the impression a 750 would outperform a 970 at GP, though - the latter is much more 'super'-scalar than the former, to put it metaphorically.
970 was more performant per-cycle than any of the previous ppc designs. It did not support L3, though, which some of the G4 did (not all G4 had L3 in Apple's machines, though). That may have affected a benchmark or two.Benchmarks. At the same frequency PPC970 loses slightly against both G3 and G4 CPU's in GP. Of course though, G5/970 was much more scalable, that was the whole point for it.
It's one of the reasons Apple never used them on laptops even if 1.5 GHz 30W parts were available; the G4 @ 1.67 GHz more than matched it.
I honestly think it's laziness, given that wasn't a stated goal of UE4 that it would scale from smart phones to high end computers?
Personally, I think it's stupid on Epic's part. Fine. Don't give it flagship UE4 games, but get your engine out on as many devices as possible.