Brad Grenz
Member
What? Where the hell are you getting those numbers from? 1/8 as fast on SIMD? Citation needed.
15GFlops vs 102GFLOPs. It's actually more like 1/7th. My apologies.
What? Where the hell are you getting those numbers from? 1/8 as fast on SIMD? Citation needed.
I know how FB operates, thank you very much (for the umpteenth time - check the discussion a few pages ago). The FLOPS workloads that the engine can handle (as in viable throughput), are one thing. The engine's requirements - an entirely different thing. The entire argument started from:
DICE guy: We tried FB2 on the WiiU and it did not perform well.
forumite: It must be the CPU!
me: how come their well-scalable-with-CPUs engine did not perform well? If they say that did not perform well outside of the context of any game, then the engine's requirements alone (read: for doing any meaningful workloads) must be pegged at something akin to 3x PPEs (which was a sarcastic statement on my part, I do not honestly expect that to be the case). And since the only advantage Xenon has over Espresso are (SIMD) FLOPS, that would mean the engine eats a significant amount of FLOPS per breakfast.
I've been aware what you're claiming since the beginning. Here's a hypothetical for you to perhaps help ypu understand what I've been asking you about single-threaded performance: do you expect FB to perform equally well compared to the 'baseline' 3x PPE if the setup contained 10 cores at ~1GHz each? How about 100 cores at 100MHz? If yes - why? If not, then how are you so sure FB would be fine on a bunch of Jaguars? We are still talking SIMD FP here, not even discussing GP code.
There's no reason why Wii U shouldn't be getting these fangled new engines if developers were willing to take the time to butcher them.
I understand that this is a business, but how do you go from announcing an unprecedented partnership mentioning EA Sports and Battlefield series, from not even calling the system next gen, not releasing Wii U versions of your multi million selling franchises, EA wants to give Wii U the Dreamcast treatment.
This thread still is going? So many salty tears from the nintendo fans. Protip, buy a ps4 or durango if frostbite games is such a big deal. Or do you plan to stay wii u only?
One'd assume all those things would be subject to scalability, but hey, if you believe they actually pegged their engine at the 3x PPEs ballpark in terms of sheer engine requirements then I have nothing to tell you. We just agree to disagree.It's almost like they built the whole engine explicitly for dynamic objects and a heavy emphasis on destructibility and material simulation require intensive vector calculations! So weird that they'd spend this whole generation creating technology that plays to the strengths of the 2 major platforms.
Wait a sec, saying Xenon beats Espresso in FLOPS is one thing which hardly anybody would argue; saying it beats it by a factor or N implies you have actual figures. So feel free to provide sustained FLOPS throuhput data.Hell, the engine's requirements could be half what Xenon provides, but it would still be well out of reach for Espresso.
You misunderstood my question. I'm not interested whether you believe FB could utilized 100 cores - for the sake of argument we can agree they can.Well, again, if you look back at the powerpoint it shows that something like 2-300 jobs are executed in any given frame. Seems like they chopped everything up into nice bite size morsels, so, yeah, there's a good chance it would work on all those configurations, assuming the infrastructure exists to keep all the processors fed.
Huh? I'm not saying FB would not work on 8 Jaguars. I'm questioning the (presumed) claim that FB (as in the engine itself) would not work well on Espresso. I've also been the one saying that if Xenon was such a hard base requirement for FB, then the latter could face issues on the Jaguars due to the large single-thread performance discrepancy in FLOPS. Since you've been arguing against that point, I've asked you to provide a hard backing of the statement, nothing more, nothing less. To which you pointed me to the generation-old FB paper where DICE first discussed the benefits of data parallelism. Yes, everybody who's been remotely interested in engine tech this gen is well aware of DICE's early breakthrough in data parallelism, and that their engine can sustain excellent FLOPS throughput. Yes, that allowed them to have good run on the ps3.But here's an idea, since you're the one making extraordinary claims (like Frostbite won't work on an 8 core Jaguar), why don't you tell us why that's the case? Or better yet, why don't you go back to the drawing board and figure out what point you're actually trying to make, because arguing that a system we know is getting Battlefield 4 can't run Battlefield 4 is the act of a crazy person.
Why the fuck are you two arguing about make believe hardware requirements of the Frostbite engine? I am pretty sure if EA big bosses say we want our franchises on this platform they would figure out how.
I highly doubt it's a technical hurdle, but a financial one
That's not this guy's claim though. He seems to be asserting that the engine would need to be massively stripped down (not just optimized) to work on Wii U. That's why this argument is going on.Is a technical hurdle that becomes a financial problem. If they could adapt FB on Wii U easily money wouldn't have been a problem.
That's not this guy's claim though. He seems to be asserting that the engine would need to be massively stripped down (not just optimized) to work on Wii U. That's why this argument is going on.
He said it beats it clock-per-clock on integer workloads. Which is nice and all, but the clock speed on Xenon is 2.6 times higher.Did you somehow entirely miss the context of the discussion? Espresso was being compared to Xenon, supposedly unfavorably, which was used as a justification for FB's absence from the WiiU. What marcan said about Espresso compared to A9 has zilch to do with the discussion. He did say something about Xenon, though, which you somehow entirely discarded. I wonder why.
That's a ridiculous statement. If a CPU A is 2 times as efficient per clock as another CPU B, then B will still outperform A if it works at 2.6 times A's the clock rate.When a CPU is largely more efficient per clock, clock differences of the magnitude of Xenon/Espresso (which apropos is 2.58, not 3), will not help you much. Why do you think Intel dropped their 4GHz P4s in favor of ~3x slower clock designs?
The amount of code lines is not a good indicator for anything, least of all the granularity of a parallel task. A 20 line task can take seconds, and a 100k LoC task can be done in a millisecond.Actually DICE quote re jobs is '15-200k C++ code each. 25k is common'. These are reasonably large routines. Are you starting to see now why I've been asking you about single-thread performance?
He said it beats it clock-per-clock on integer workloads. Which is nice and all, but the clock speed on Xenon is 2.6 times higher.
That's a ridiculous statement. If a CPU A is 2 times as efficient per clock as another CPU B, then B will still outperform A if it works at 2.6 times A's the clock rate.
If it was only on PS4, Durango, the weak CPU would be accepted, but they are developing for PS3 and 360, certainly the userbase was enough to overcome the hurdles there.
Are you saying anything less than a factor of 2.6 is not significant?We don't know the relative efficiencies but I'm assuming if it wasn't at least some somewhat significant amount he wouldn't even mention it
So why are you so vehemently opposed to the further conclusion of this line of thought: "Yes, that's what makes their engine hard to port to such a CPU FLOPS-deficient architecture as the Wii U"?Yes, everybody who's been remotely interested in engine tech this gen is well aware of DICE's early breakthrough in data parallelism, and that their engine can sustain excellent FLOPS throughput. Yes, that allowed them to have good run on the ps3.
He said it beats it clock-per-clock on integer workloads. Which is nice and all, but the clock speed on Xenon is 2.6 times higher.
That's a ridiculous statement. If a CPU A is 2 times as efficient per clock as another CPU B, then B will still outperform A if it works at 2.6 times A's the clock rate.
Yes, a 2 line asm function can take eons. And yet, the size of the code is normally indicative of the granularity. Which we already got a quote for - 300 jobs per frame is very low granularity - even if they all occurred sequentially (which would void the whole idea of the engine, but just for the sake of argument), at 30fps that's north of 100us per task - if that's not long I don't know what is.The amount of code lines is not a good indicator for anything, least of all the granularity of a parallel task. A 20 line task can take seconds, and a 100k LoC task can be done in a millisecond.
Because FB is not a Xenon/Cell-exclusive engine. It also happens to run on dual-core x86 setups from years ago.So why are you so vehemently opposed to the further conclusion of this line of thought: "Yes, that's what makes their engine hard to port to such a CPU FLOPS-deficient architecture as the Wii U"?
You assigned an arbitrary value first (one higher than 2.6), because it's the only way your statement works. I merely illustrated that if you pick a different value it works out differently.Marcan said a universal truth - PPE's IPC is abysmal. I said 'if the IPC of the lower clock unit is largely higher, 2.6x clock will not cut it'. You come and pick arbitrary value to 'largely' and claim the statement is ridiculous. Pardon my french, but WTH?
I don't understand your whole argument about throughput and latency, or how it relates to the discussion at hand. As far as I can tell, you wanted to make a point that, if FB3 were designed for Xenon-level CPU FP performance, it would be hard to port to PS4/720. And your idea is that per-core performance could be insufficient (because it can't be overall performance, as that is equivalent in theory and clearly superior in practice on the AMD platform). However, if 3 cores are sufficient to process 300 jobs per frame, then -- unless there is a ridiculous load imbalance -- the workload per core, even on the core with the largest-grain tasks, should consist of at least 40 parallel tasks. Distributing those 40 tasks further on the 2 or a bit more cores required to match the single Xenon core would not appear to be an issue.Yes, a 2 line asm function can take eons. And yet, the size of the code is normally indicative of the granularity. Which we already got a quote for - 300 jobs per frame is very low granularity - even if they all occurred sequentially (which would void the whole idea of the engine, but just for the sake of argument), at 30fps that's north of 100us per task - if that's not long I don't know what is.
BF3, as an example of one FB game they might be interested in porting to Wii U, does not run well at all on dual core chips -- even if said chips are clocked at 2.7 times the frequency of the Wii U CPU (and offer higher per-clock SIMD performance): http://www.pcgameshardware.de/Battl...ld-3-Multiplayer-Tipps-CPU-Benchmark-1039293/Because FB is not a Xenon/Cell-exclusive engine. It also happens to run on dual-core x86 setups from years ago.
Yes, I did. You, on the other hand, assumed that that arbitrary value was so small that the statement would not hold. Can't you see the disconnection?You assigned an arbitrary value first (one higher than 2.6), because it's the only way your statement works. I merely illustrated that if you pick a different value it works out differently.
First of all, those jobs are not all parallel - there's a job depencency graph in there. From there on, you cannot discard the possibility that there could be dependency paths of jobs where the overall path length becomes critical for the single-core performance. The reason I brought in task granularity was that if you tried to further split such hot-path jobs into even smaller parallel ones (assuming you can), you'd hit at one stage the latency/throughput saturation.I don't understand your whole argument about throughput and latency, or how it relates to the discussion at hand. As far as I can tell, you wanted to make a point that, if FB3 were designed for Xenon-level CPU FP performance, i would be hard to port to PS4/720. And your idea is that per-core performance could be insufficient (because it can't be overall performance, as that is equivalent in theory and clearly superior in practice on the AMD platform). However, if 3 cores are sufficient to process 300 jobs per frame, then -- unless there is a ridiculous load imbalance -- the workload per core, even on the core with the largest-grain tasks, should consist of at least 40 parallel tasks. Distributing those 40 tasks further on the 2 or a bit more cores required to match the single Xenon core would not appear to be an issue.
Of course. I never questioned that.On the other hand, running tasks that come even close to utilizing Xenon's theoretical FP performance on Wii Us CPU is impossible.
Well, if the topic of this discussion was 'DICE guy says they tested Espresso for BF3 prospects and decided platform was not a good fit' I would not have joined the discussion in the first place. The premise of this discussion formed quite differently though - 'FB2 does not perform well on Espresso'.BF3, as an example of one FB game they might be interested in porting to Wii U, does not run well at all on dual core chips -- even if said chips are clocked at 2.7 times the frequency of the Wii U CPU (and offer higher per-clock SIMD performance): http://www.pcgameshardware.de/Battl...ld-3-Multiplayer-Tipps-CPU-Benchmark-1039293/
If it was only on PS4, Durango, the weak CPU would be accepted, but they are developing for PS3 and 360, certainly the userbase was enough to overcome the hurdles there.
I think that distinction is meaningless. If games using FB2 (for which BF3 is the most recent and relevant example) fail to perform well on Espresso, then for all intents and purposes FB2 fails to perform well on Espresso. Or would you expect them to port the engine but not the games using it?Well, if the topic of this discussion was 'DICE guy says they tested Espresso for BF3 prospects and decided platform was not a good fit' I would not have joined the discussion in the first place. The premise of this discussion formed quite differently though - 'FB2 does not perform well on Espresso'.
Erm, the timing snapshot is for a particular game (BC2@ps3). What the engine requires in terms of vector capabilities is not something you could conclude from that snapshot.
I know how FB operates, thank you very much (for the umpteenth time - check the discussion a few pages ago). The FLOPS workloads that the engine can handle (as in viable throughput), are one thing. The engine's requirements - an entirely different thing. The entire argument started from:
DICE guy: We tried FB2 on the WiiU and it did not perform well.
forumite: It must be the CPU!
me: how come their well-scalable-with-CPUs engine did not perform well? If they say that did not perform well outside of the context of any game, then the engine's requirements alone (read: for doing any meaningful workloads) must be pegged at something akin to 3x PPEs (which was a sarcastic statement on my part, I do not honestly expect that to be the case). And since the only advantage Xenon has over Espresso is (SIMD) FLOPS, that would mean the engine eats a significant amount of FLOPS for breakfast. Like how the software occlusion culling should eat FLOPS on the PS3, but that does not have to be the case on other platforms where the trisetup is on the average twice faster than on the RSX.
One'd assume all those things would be subject to scalability, but hey, if you believe they actually pegged their engine at the 3x PPEs ballpark in terms of sheer engine requirements then I have nothing to tell you. We just agree to disagree.
Are you familiar with the concept of throughput versus latency? The smaller the packet, the larger the packet latency overhead per the packet payload. '300 jobs per frame' is ultra-coarse work granularity - games can do thousands of traditional fine-granularity computational routine calls per frame. So, a coarse granularity packet suggest high performance of the individual core. Actually DICE quote re jobs is '15-200k C++ code each. 25k is common'. These are reasonably large routines. Are you starting to see now why I've been asking you about single-thread performance?
let's be honest people. we all know why EA doesn't want to waste money getting their go to engine for next gen running on Wii U, the return on investment just isn't there.
EA is looking to cut costs and make money right now and wasting money on the Wii U is pointless when the user base is tiny and it's pretty obvious that nintendo didn't catch lightning in a bottle again.
PS360 can still get ports because the ROI is still there. The user base is massive and they already had previous versions of their engines up and running for years. With the Wii U, it's different. They'd have to spend time and money to get their engine running on it. They tried it and didn't like what they saw. I believe they saw that they would need to spend a decent amount of time to fine tune the engine to run on the system and for what? What game will they release on the Wii U that would give them a decent ROI? Madden? Fifa? Battlefield? Are we honestly going to sit here and realistically say that the people buying a Wii U are doing it for these titles? C'mon man.
If the ROI isn't there, your company shouldn't be there. That's the name of the game and EA is making the right decision. I believe they can get the engine running on it, but it's not worth it plain and simple.
Not at all. I'm just still waiting for your response on the subject of PPE's non-SIMD prowess ; )It's almost as if you didn't read anything I posted.
(You put me on ignore, didn't you
No, my argument is that EA can ship whatever they like on whatever platform they fancy. Neither I'm questioning their motives. I'm just questioning the face-value interpretation of the DICE employee statement.Let me ask you this: so is your argument that EA and DICE should ship Frostbite on WiiU?
Why, thank you for supporting my side of the argument : ) Even though you somehow got the impression I was arguing the opposite ; )Honest question. Do you really think that you will see games distributing jobs in ultra fine grained payloads?
You do realize that any job manager has a certain amount of overhead, right? Even on a platform that doesn't have to DMA local store in and out of its coprocessors. The more often you switch jobs, the more often you incur that overhead, so it makes sense to chunk the work out into reasonably sized payloads to try to maximize throughput.
Why the fuck are you two arguing about make believe hardware requirements of the Frostbite engine? I am pretty sure if EA big bosses say we want our franchises on this platform they would figure out how.
I highly doubt it's a technical hurdle, but a financial one
Not at all. I'm just still waiting for your response on the subject of PPE's non-SIMD prowess ; )
Why, thank you for supporting my side of the argument : ) Even though you somehow got the impression I was arguing the opposite ; )
Wtf, no! Bobcat is in no way a dual-core, pseudo-dual-core or anything else beyond single-core. It doesn't even have SMT.Both Bobcat and Bulldozer are pseudo dual core chips, but the implementation is indeed different.
Yeah, I noticed a few hours ago. Doesn't matter though because it doesn't change the CoreMark results - which is why I didn't edit my post.Wtf, no! Bobcat is in no way a dual-core, pseudo-dual-core or anything else beyond single-core. It doesn't even have SMT.
The part about Xenon's computational superiority over Espresso, SIMD non-withstanding. Just give me a bone here, anything.Is this how you "win" arguments on the internet these days? Sheesh.
I'll play along, though. Which part did I not respond to?
My point was that one cannot say, 'Given that a pool of cores yields a certain amount of FLOPS (x * y = z; x: num_cores, y: single_core_flops), another bigger pool of slower cores yielding the same total amount of FLOPS (a * b = z; a > x, b < y) can be deemed performance-equivalent to the original pool'.No, I had no idea what the hell you are talking about as it seemed pretty much irrelevant to the discussion, so I was trying to understand why you were bringing it up. So, what was your point again?
I'm not entirely sure what you're saying, but if it's what it seems to be, here's what a bobcat looks like:Wtf, no! Bobcat is in no way a dual-core, pseudo-dual-core or anything else beyond single-core. It doesn't even have SMT.
$ grep -A 4 processor /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 20
model : 2
model name : AMD C-60 APU with Radeon(tm) HD Graphics
--
processor : 1
vendor_id : AuthenticAMD
cpu family : 20
model : 2
model name : AMD C-60 APU with Radeon(tm) HD Graphics
I assumed Bobcat was some sort of "CMT lite" design - which it isn't.I'm not entirely sure what you're saying, but if it's what it seems to be, here's what a bobcat looks like:
Code:$ grep -A 4 processor /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 20 model : 2 model name : AMD C-60 APU with Radeon(tm) HD Graphics -- processor : 1 vendor_id : AuthenticAMD cpu family : 20 model : 2 model name : AMD C-60 APU with Radeon(tm) HD Graphics
No, it's literally two brazos cores + a gpu on a die.I assumed Bobcat was some sort of "CMT lite" design - which it isn't.
The part about Xenon's computational superiority over Espresso, SIMD non-withstanding. Just give me a bone here, anything.
Was it not clear enough from my previous post? It's sheer brute force.
Let's say processing power = clock speed * average work done per clock.
You'll notice that I never disagreed that the Xenon is much less efficient (less work done per clock) compared to the Espresso. But at the same time, it's clocked nearly 3x higher. So basically, in the end, it seems that this is the case:
(3.2Ghz * lower Xenon work per clock) > (1.25Ghz * better Espresso work per clock)
Simple as that.
My point was that one cannot say, 'Given that a pool of cores yields a certain amount of FLOPS (x * y = z; x: num_cores, y: single_core_flops), another bigger pool of slower cores yielding the same total amount of FLOPS (a * b = z; a > x, b < y) can be deemed performance-equivalent to the original pool'.
I wasn't making myself very clear. Yes, there are dual-core configurations. But a Bobcat core has one instruction pointer, and runs one hardware thread. Completely unlike Bulldozer where two hardware threads are built into the core architecture (1 "module" runs two threads, and there is no such thing as a single-core/single-thread Bulldozer), and also unlike SMT/"HyperThreading" architectures where IP and register state are duplicated, but all execution hardware is shared between two (or more threads).I'm not entirely sure what you're saying, but if it's what it seems to be, here's what a bobcat looks like:
Ok, I suspected that was a misunderstanding. Yes, bobcat is as vanilla x86_64 core as they come.I wasn't making myself very clear. Yes, there are dual-core configurations. But a Bobcat core has one instruction pointer, and runs one hardware thread. Completely unlike Bulldozer where two hardware threads are built into the core architecture (1 "module" runs two threads, and there is no such thing as a single-core/single-thread Bulldozer), and also unlike SMT/"HyperThreading" architectures where IP and register state are duplicated, but all execution hardware is shared between two (or more threads).
I think the "dual decoder" thing can easily be misinterpreted as meaning two threads can run simultaneously. But it's really just an increase of hardware resources that all go toward the execution of a single thread per core.
I saw that post. It does not really answer the question though, just explains how it could be faster. I'm expecting a 'Xenon is faster because such and such characteristics of its pipeline, under such and such conditions'.Seriously?
I thought I was bringing up an obvious point to Brad Grenz re how 8 Jaguars are not necessarily immune to not being able to perform on par with 3 PPEs in some intense SIMD scenarios. Little did I know it would get out of hand.I do not necessarily have a bone to pick with that, but honestly in my experience most of the jobs that are going to emphasize floating point performance tend towards tasks that have few dependencies against each other (so it is relatively simple to throw more cores at the problem). I think most of the jobs in Frostbite are engineered this way - you can see for yourself by the similarly colored blocks on the timing view screenshot.
I'm still confused at the premise, it seemed like you are the one who is suddenly talking about PS4/Durango, so I'm not sure why you are discussing this in a thread about the WiiU. Do you really think they will port Frostbite from the current generation consoles to next gen console instead of starting from DX11 PC with compute shaders?
Yes.Are you aware that there are feature set differences between Frostbite on current gen console and PC?
Used to do game engine R&D and maintenance for a living. Not anymore.I have an honest question for you. Do you work on games?
The problem is that we don't know the multiplier for Espresso, but if we use CoreMark, and compare a PPE to the 1998 entry-level PowerPC 405 (only 32kB L1, no L2, only one integer unit, no dynamic branch prediction), we get this:Was it not clear enough from my previous post? It's sheer brute force.
Let's say processing power = clock speed * average work done per clock.
You'll notice that I never disagreed that the Xenon is much less efficient (less work done per clock) compared to the Espresso. But at the same time, it's clocked nearly 3x higher. So basically, in the end, it seems that this is the case:
(3.2Ghz * lower Xenon work per clock) > (1.25Ghz * better Espresso work per clock)
Simple as that.
That's not this guy's claim though. He seems to be asserting that the engine would need to be massively stripped down (not just optimized) to work on Wii U. That's why this argument is going on.
(Before I get flamed by a bunch of Nintendo fans - this doesn't mean the WiiU will never get games worth playing or anything. Lots of games have PS2-level complexity gameplay code and will run great on the WiiU - that's not a diss on those games at all! No snarkiness here - all you WiiU fans should probably be rooting for stuff to get ported to Vita because if it can make it there intact, it should be able to make it to the WiiU intact. As an example, look at all the fighting games on Vita, IMHO in general these types of games have low simulation complexity so they should be great on WiiU!)
I don't think anybody claimed that. Nice straw man.And apparently armchair gaffers know the frostbite engine better than the technical director himself.
The WiiU struggles with heavy action during single player CoD.
That alone with worry me with BF games.
Why is the Wii U looking out into the rain? It's stuck inside the house?
edit: Wait, I guess the pad would be reversed if that were the case. But then why is the controller semi-transparent/turning invisible?
The PPE benchmark apparently didn't take SMT into account, so the score should be a few percent higher. Probably around 20 - 30%. Still, considering what the 405 is, it already comes surprisingly close at Espresso clockspeeds. Of course, if we look at SIMD performance, there's no contest - Xenon and CELL mop the floor with Espresso.
I saw that post. It does not really answer the question though, just explains how it could be faster. I'm expecting a 'Xenon is faster because such and such characteristics of its pipeline, under such and such conditions'.
Yes.
Used to do game engine R&D and maintenance for a living. Not anymore.
I hope you're joking. Your paragraph that allegedly will prevent a bunch of Nintendo fans from flaming you is the part of your post that will likely incite the most flaming from Nintendo fans.
Vita ports? PS2 level complexity? I hope you're joking/trolling. Because if you're serious you have no business even taking part in this discussion. I don't think even Brad Genz would claim Wii U needs things dumbed down to PS2 levels to run properly.
I don't think anybody claimed that. Nice straw man.
No, an 8 core Jaguar at 1.8 Ghz is pretty much on par with Xenon in that metric.Perhaps I'm mistaken, but don't Cell and Xenon PPE cores mop the floor with a Jaguar as well, in terms of peak FLOPS?
Yeah, and why would he be wrong? Everything (their statements, presentations, and the performance of FB games on PS3 and various PC CPUs) points to FB games making great use of parallel CPU SIMD FLOPS. Which the Wii U lacks in. Greatly.this thread seems to have a lot of baseless claims though.
Repi said test results were not promising and unless someone here has extensive knowledge of the FB engine, he cant be proven wrong.
Yeah, and why would he be wrong? Everything (their statements, presentations, and the performance of FB games on PS3 and various PC CPUs) points to FB games making great use of parallel CPU SIMD FLOPS. Which the Wii U lacks in. Greatly.