• Hey, guest user. Hope you're enjoying NeoGAF! Have you considered registering for an account? Come join us and add your take to the daily discourse.

EuroGamer: More details on the BALANCE of XB1

artist

Banned
May 7, 2006
16,629
0
0
Thanks to Sporran.

On the ESRAM:
Leadbetter said:
"If you're only doing a read you're capped at 109GB/s, if you're only doing a write you're capped at 109GB/s," he says. "To get over that you need to have a mix of the reads and the writes but when you are going to look at the things that are typically in the ESRAM, such as your render targets and your depth buffers, intrinsically they have a lot of read-modified writes going on in the blends and the depth buffer updates. Those are the natural things to stick in the ESRAM and the natural things to take advantage of the concurrent read/writes."
On ESRAM bandwidth number:
The same discussion with ESRAM as well - the 204GB/s number that was presented at Hot Chips is taking known limitations of the logic around the ESRAM into account. You can't sustain writes for absolutely every single cycle. The writes is known to insert a bubble [a dead cycle] occasionally... one out of every eight cycles is a bubble so that's how you get the combined 204GB/s as the raw peak that we can really achieve over the ESRAM. And then if you say what can you achieve out of an application - we've measured about 140-150GB/s for ESRAM
On balancing the GPU:
Leadbetter said:
Every one of the Xbox One dev kits actually has 14 CUs on the silicon. Two of those CUs are reserved for redundancy in manufacturing, but we could go and do the experiment - if we were actually at 14 CUs what kind of performance benefit would we get versus 12? And if we raised the GPU clock what sort of performance advantage would we get? And we actually saw on the launch titles - we looked at a lot of titles in a lot of depth - we found that going to 14 CUs wasn't as effective as the 6.6 per cent clock upgrade that we did."
On GPGPU:
Leadbetter said:
Microsoft's approach to asynchronous GPU compute is somewhat different to Sony's - something we'll track back on at a later date.
http://www.eurogamer.net/articles/digitalfoundry-vs-the-xbox-one-architects

If you guys find something more interesting, throw it up and I'll add to the OP, obviously keeping the word count of quoted text to a minimum.
 

IN&OUT

Banned
Jun 7, 2013
1,827
0
0
MS engineers referenced Vgleaks articles about 14+4 PS4 CUs that was later debunked by Cerny in an interview.

Goosen also believes that leaked Sony documents on VGLeaks bear out Microsoft's argument:

"Sony was actually agreeing with us. They said that their system was balanced for 14 CUs. They used that term: balance. Balance is so important in terms of your actual efficient design.

hilarious. lol

I will try to find Cerny interview when asked about this.
 

benny_a

extra source of jiggaflops
Apr 25, 2009
17,350
1
0
Good thing they sort of explained the eSRAM bandwidth math that people have been questioning ever since Leadbetter did the other article where he just repeated it uncritically.

I would have liked to know exactly how they got to the number that was repeated over and over again, because they are talking about real world scenarios where they get 140-150GB/s and not the (then) 192GB/s and more recently the 204GB/s number.

Edit: I missed the explanation in the side-bar. The bubble!
 

Nozem

Member
May 24, 2013
3,277
0
0
The article is a good read, and quite insightful about the choices Microsoft made. But it doesn't address how they are going to close the performance gap, for obvious reasons.
 

KMS

Member
Jun 22, 2013
521
0
330
Beaverton, OR
GPU compute is probably the equivalent of programmable shaders of last gen and once people start finding clever ways to use it the console war is over. Which is a good thing as 7+ years is way too long for me myself to go on the same hardware.
 

artist

Banned
May 7, 2006
16,629
0
0
The article is a good read, and quite insightful about the choices Microsoft made. But it doesn't address how they are going to close the performance gap, for obvious reasons.
The conclusion to the article doesnt make sense, it looks like there was some sort of bridge or a (really critical) wall of text that was skipped.
 
May 22, 2011
30,473
2
0
The article is a good read, and quite insightful about the choices Microsoft made. But it doesn't address how they are going to close the performance gap, for obvious reasons.
The performance gap is already set in stone. I dont think there is any way they can just close it. The article does indicate that they want to show it in their games though..

"Firstly though, we don't have any games out. You can't see the games. When you see the games you'll be saying, 'what is the performance difference between them'. The games are the benchmarks. We've had the opportunity with the Xbox One to go and check a lot of our balance. The balance is really key to making good performance on a games console. You don't want one of your bottlenecks being the main bottleneck that slows you down."
anyways.. dat BALANCE
 

benny_a

extra source of jiggaflops
Apr 25, 2009
17,350
1
0
This confirms that every Xbox One actually has 14 CU's because as we know per Albert Penellow there is no hardware difference between a consumer Xbox One and a developer Xbox One.

In the future we will see Microsoft's own version of "unlocking the last SPU" debate with this. ;-)
 

2San

Member
Dec 11, 2009
9,852
0
695
The Netherlands
Ignoring the RAM the XB1 does seem like a more balanced system, however the power gap is pretty big and the PS4 is also cheaper. Is the XB1 selling with a profit or something?
 

artist

Banned
May 7, 2006
16,629
0
0
Ignoring the RAM the XB1 does seem like a more balanced system, however the power gap is pretty big and the PS4 is also cheaper. Is the XB1 selling with a profit or something?
I think they were pretty close to making a profit off the bat or something on those lines. Definitely unlike the ~$60 loss per PS4 on launch.
 

astraycat

Member
Apr 3, 2013
446
0
0
Re-post from the other thread, since it's really much more appropriate here:

Now that I've actually finished reading it, there's a lot of interesting tidbits in there. It still leaves me with a bunch of questions though.

ESRAM is fully integrated into our page tables and so you can kind of mix and match the ESRAM and the DDR memory as you go
This confirms that the ESRAM is absolutely not a cache but a developer-controlled scratchpad, which renders the Intel 32MiB EDRAM cache size arguments irrelevant.

"If you're only doing a read you're capped at 109GB/s, if you're only doing a write you're capped at 109GB/s," he says. "To get over that you need to have a mix of the reads and the writes but when you are going to look at the things that are typically in the ESRAM, such as your render targets and your depth buffers, intrinsically they have a lot of read-modified writes going on in the blends and the depth buffer updates. Those are the natural things to stick in the ESRAM and the natural things to take advantage of the concurrent read/writes."
This still doesn't really explain it to me. Unless there's some special purpose hardware there that can do a read-modify-write on the ESRAM side the of the 1024-bit bus, there's just no way to send more than 1024-bits per cycle, which tops out at 109GB/s at 853MHz.

"Everybody knows from the internet that going to 14 CUs should have given us almost 17 per cent more performance," he says, "but in terms of actual measured games - what actually, ultimately counts - is that it was a better engineering decision to raise the clock. There are various bottlenecks you have in the pipeline that can cause you not to get the performance you want if your design is out of balance."
This is really telling. This is an admission that the CUs can (and do) starve due to deficiencies elsewhere in the GPU pipeline. It's too bad that the bottlenecks aren't actually elaborated.

"But we also increase the performance in areas surrounding bottlenecks like the drawcalls flowing through the pipeline, the performance of reading GPRs out of the GPR pool, etc. GPUs are giantly complex. There's gazillions of areas in the pipeline that can be your bottleneck in addition to just ALU and fetch performance."
The GPR comment is pretty weird. GPRs allocation for shaders is done during compilation of the shader, and the only real rule is to minimize them without having to resort to spilling to some sort of scratch buffer. What he could mean, instead of allocation, is just filling them in the first place (loading uniforms and the like). That's a sort of difficult problem, but one I would hope that they have a handle on.

You can use the Move Engines to move these things asynchronously in concert with the GPU so the GPU isn't spending any time on the move. You've got the DMA engine doing it.
"From a power/efficiency standpoint as well, fixed functions are more power-friendly on fixed function units," adds Nick Baker. "We put data compression on there as well, so we have LZ compression/decompression and also motion JPEG decode which helps with Kinect. So there's a lot more to the Data Move Engines than moving from one block of memory to another."
The GCN DMA engines are part of what they're calling a Data Move Engine? Is the Data Move Engine really just a bunch of other bits of fixed function that they've grouped together even though they're actually separate?

Microsoft's approach to asynchronous GPU compute is somewhat different to Sony's - something we'll track back on at a later date. But essentially, rather than concentrate extensively on raw compute power, their philosophy is that both CPU and GPU need lower latency access to the same memory. Goosen points to the Exemplar skeletal tracking system on Kinect on Xbox 360 as an example for why they took that direction.

"Exemplar ironically doesn't need much ALU. It's much more about the latency you have in terms of memory fetch, so this is kind of a natural evolution for us," he says. "It's like, OK, it's the memory system which is more important for some particular GPGPU workloads."
Here comes latency. Are talking about the GDDR5 vs. DDR3 latency, or ESRAM vs. main memory, or something else entirely?
 

Chobel

Member
Mar 26, 2013
15,673
3
515
Ignoring the RAM the XB1 does seem like a more balanced system, however the power gap is pretty big and the PS4 is also cheaper. Is the XB1 selling with a profit or something?
What does balanced system even mean?
 

gofreak

GAF's Bob Woodward
Jun 8, 2004
43,345
2
1,645
Every one of the Xbox One dev kits actually has 14 CUs on the silicon. Two of those CUs are reserved for redundancy in manufacturing, but we could go and do the experiment - if we were actually at 14 CUs what kind of performance benefit would we get versus 12? And if we raised the GPU clock what sort of performance advantage would we get? And we actually saw on the launch titles - we looked at a lot of titles in a lot of depth - we found that going to 14 CUs wasn't as effective as the 6.6 per cent clock upgrade that we did."

Basically confirms how ROPs limited a lot of games are on X1.

If it had a decent amount of ROPs more CUs would have been better than more clock.

This is basically saying 'the upclock was a better choice because one part of the pipeline in our GPU is rather weak and forming a bottleneck - a part we're conveniently not discussing at all in this article'
 

benny_a

extra source of jiggaflops
Apr 25, 2009
17,350
1
0
Why he said it was wrong can be explained by because he isn't down in the nitty-gritty details.

But why his office has told him that the 204GB/s figure is wrong, when this article reveals it's their maximum bandwidth in a best case scenario is strange.

Anyway, this new information is better and explains how they get there which is cool.
 

2San

Member
Dec 11, 2009
9,852
0
695
The Netherlands
I have no idea what MS's strategy is.
I think they were pretty close to making a profit off the bat or something on those lines. Definitely unlike the ~$60 loss per PS4 on launch.
Sony has the right mentality. They already have the pre-release mind share, but still have aggressive pricing. It's going to be another long generation.

I disagree. PS4 was built for max GPGPU performance. The GPU is powerful on purpose.
Don't know reading discussions on Neogaf. GPGPU can only take over certain types of CPU calculations. Depending on the type of game the weaker CPU can get in the way. Don't get me wrong PS4 will be better for all games with a significant margin. Especially if you factor in the RAM that really suits gaming purposes.

What does balanced system even mean?
That you don't have overpowered parts that get bottlenecked by other parts. It's pointless for example to have an amazingly good CPU, if your GPU isn't up to snuff.
 

lord pie

Member
Jun 10, 2007
986
0
0
Bend, Oregon
And if we raised the GPU clock what sort of performance advantage would we get? And we actually saw on the launch titles - we looked at a lot of titles in a lot of depth - we found that going to 14 CUs wasn't as effective as the 6.6 per cent clock upgrade that we did.
See, to me, this suggests that the GPU is probably fillrate limited in launch titles. A 16% compute increase was beaten by a 6.6% clock increase. So in other words, performance is bottlenecked somewhere other than compute units - for example the 16 ROPs.
 

artist

Banned
May 7, 2006
16,629
0
0
Basically confirms how ROPs limited a lot of games are on X1.

If it had a decent amount of ROPs more CUs would have been better than more clock.

This is basically saying 'the upclock was a better choice because one part of the pipeline in our GPU is rather weak and forming a bottleneck - a part we're conveniently not discussing at all in this article'
Disagree, the 7790 has 16 ROPs and scales well with the CUs. I'll try to relook benches for the 7770 and 7790 to confirm.
 

IN&OUT

Banned
Jun 7, 2013
1,827
0
0
Ignoring the RAM the XB1 does seem like a more balanced system, however the power gap is pretty big and the PS4 is also cheaper. Is the XB1 selling with a profit or something?
OK let's see:

PS4 has a unified memory with 18CUs and tight APU configuration

vs

X1 with less CUs and tiny esram and slow DDR3 ram

and we have one comment believing MS "Balance" spin.

yep, MS has succeeded .
 

nib95

Banned
Feb 26, 2007
34,618
2
0
Leadbetter pushing the Microsoft official angle again I see. i wonder if this "technical fellow" was actually their previous source all along? Having skimmed through the article, this one seems much more balanced (no pun intended) than his last few. It's not as strictly misleading. One particular comment by Baker that did genuinely make me laugh though...

"Yeah, I think that's right. In terms of getting the best possible combination of performance, memory size, power, the GDDR5 takes you into a little bit of an uncomfortable place"
Lol.

I would argue that the PS4 is not only a more capable and powerful machine, it's better "balanced" and efficient too. This article doesn't do much to really respond to all the questions and concerns people had regarding the XO's hardware, especially in comparison to the PS4 and how it bridges the performance gulf. But in terms of GPGPU capability, bandwidth, efficiency, creative bandwidth management, unified ram etc, the PS4 seems more ahead of the curve. It even has that secondary chip for additional processing whilst the XO's advanced audio chip is really mainly for audio use, or more specifically, to alleviate the greater resource demands of Kinect functionality, something the PS4 does need have to consider even to begin with.
 

2San

Member
Dec 11, 2009
9,852
0
695
The Netherlands
OK let's see:

PS4 has a unified memory with 18CUs and tight APU configuration

vs

X1 with less CUs and tiny esram and slow DDR3 ram

and we have one comment believing MS "Balance" spin.

yep, MS has succeeded .
Did you even the read the comment you replied to?
 

benny_a

extra source of jiggaflops
Apr 25, 2009
17,350
1
0
Did you even the read the comment you replied to?
The post he replied to is dumb though. How can you talk about balance if you take away a very important part that makes you claim it's balanced.

Xbox One's APU without on chip memory (eSRAM) and access to memory which supplies the iGPU is what exactly?
 
Feb 15, 2013
8,474
5,588
775
London
Microsoft's approach to asynchronous GPU compute is somewhat different to Sony's - something we'll track back on at a later date. But essentially, rather than concentrate extensively on raw compute power, their philosophy is that both CPU and GPU need lower latency access to the same memory. Goosen points to the Exemplar skeletal tracking system on Kinect on Xbox 360 as an example for why they took that direction.
Kinect is part of the reason for the slower specs. What an unfortunate day it was when they decided to build their system around that.

It has lead to lower power and a higher price.
 

brain_stew

Member
Feb 20, 2007
19,261
1
1,215
Basically confirms how ROPs limited a lot of games are on X1.

If it had a decent amount of ROPs more CUs would have been better than more clock.

This is basically saying 'the upclock was a better choice because one part of the pipeline in our GPU is rather weak and forming a bottleneck - a part we're conveniently not discussing at all in this article'
I'd say this is one of the best snippets from the article and it makes very poor reading for Microsoft.

Fillrate is the area where Sony have their biggest advantage (they were straight up twice as fast before the recent clock bumps) so if Microsoft are bottlenecked here then it's a worst case scenario for Microsoft. It's an also an area that many people discounted based purely off current generation workloads (something I always felt was very naive at the time) and here we are with launch software painting a very different picture based off Microsoft's own internal tests.
 

gofreak

GAF's Bob Woodward
Jun 8, 2004
43,345
2
1,645
Disagree, the 7790 has 16 ROPs and scales well with the CUs. I'll try to relook benches for the 7770 and 7790 to confirm.
This is obviously game dependent, but if the games they looked at scaled better with a 6.6% upclock than a 16.6% increase in ALU, then it suggests they were being limited more by other parts of pipeline. ROPs is the easy one to finger.
 

Chobel

Member
Mar 26, 2013
15,673
3
515
That all the components work optimally with eachother, and there are no bottlenecks.
That you don't have overpowered parts that get bottlenecked by other parts.
Does that mean MS thinks that PS4 will have more bottlenecks?

It's pointless to have an amazingly good CPU, if your GPU isn't up to snuff.
For CPU maybe, but having a better GPU is always better.
for most of the console games
 

nib95

Banned
Feb 26, 2007
34,618
2
0
Basically confirms how ROPs limited a lot of games are on X1.

If it had a decent amount of ROPs more CUs would have been better than more clock.

This is basically saying 'the upclock was a better choice because one part of the pipeline in our GPU is rather weak and forming a bottleneck - a part we're conveniently not discussing at all in this article'
Interesting pick up. I know you're very well versed in this field so it's always interesting to get your take or deconstruction on such things.
 

2San

Member
Dec 11, 2009
9,852
0
695
The Netherlands
The post he replied to is dumb though. How can you talk about balance if you take away a very important part that makes you claim it's balanced.

Xbox One's APU without on chip memory (eSRAM) and access to memory which supplies the iGPU is what exactly?
Well this isn't really an interesting discussion if we factor in everything. The post main point is actually, that I'm kinda worried about the underpowered CPU. The CPU seems like a product of wanting a low powered system rather than having the parts work in unison.

Does that mean MS thinks that PS4 will have more bottlenecks?
That's what they think.
For CPU maybe, but having a better GPU is always better.
for most of the console games
I wouldn't say that's the case, but we'll see. I'm not really that well versed in tech. Looking at past consoles and what counts as a balanced PC. The CPU just seems out of place even factoring in GPGPU.
 

gofreak

GAF's Bob Woodward
Jun 8, 2004
43,345
2
1,645
I'd say this is one of the best snippets from the article and it makes very poor reading for Microsoft.

Fillrate is the area where Sony have their biggest advantage (they were straight up twice as fast before the recent clock bumps) so if Microsoft are bottlenecked here then it's a worst case scenario for Microsoft. It's an also an area that many people discounted based purely off current generation workloads (something I always felt was very naive at the time) and here we are with launch software painting a very different picture based off Microsoft's own internal tests.
Yeah.

This isn't to say a PS4 would be 90% or whatever better on these workloads - the bottleneck would move up the pipe to somewhere else before you got to that in most cases - but it does also reveal something about the ALU talk.

12 CUs might be 'balanced' - for a GPU with that ROP configuration. Maybe in typical (early-gen) shader workloads more ALUs would not help framerate with that output capacity.

But if you had 32 ROPs, and your bottleneck moved up the pipeline, you could certainly do with more CUs to hold on to performance gains. A discussion about CUs and balance without touching on these contexts is an incomplete one to say the least.
 

artist

Banned
May 7, 2006
16,629
0
0
Yup, looking at other res's as well.

This is obviously game dependent, but if the games they looked at scaled better with a 6.6% upclock than a 16.6% increase in ALU, then it suggests they were being limited more by other parts of pipeline. ROPs is the easy one to finger.
Absolutely but launch titles are hardly representative of future workloads.

You can't use current generation software to judge this.
Correct, I'm also looking at it from the perspective of GPGPU, those extra CUs would be quite handy.