• Hey, guest user. Hope you're enjoying NeoGAF! Have you considered registering for an account? Come join us and add your take to the daily discourse.

NV vs AMD teraflops (TF) demystified

psorcerer

Member
May 1, 2012
1,528
2,039
920
Calm down, I know people will try to lynch me for this one.
But let's do it.
TL;DR you can compare NV and AMD cards by TF just fine, if you know what you're doing.

Let's start with some history.
In the year 2016 NV released a new architecture: NV Pascal.
The top card using that arch at that time was GF 1080.
So, let's deep dive into the whitepaper.
We are interested in the blocks that actually do the computations.
NV calls it Streaming Multiprocessors (SMs).
SMs are organized in groups with some other logic and a re called GPC.
Our 1080 has 4 GPCs, with 5 SMs each (total 20 SMs)

What's doing the job inside the SM?
Each SM contains 128 CUDA cores and that's our computational blocks.
Each CUDA core can do 2 FP32 FMA operations per clock.

What is FMA? FP32?
FMA = fused multiply add. It's an operation on two operands: A * B + C.
But there are three (A, B, C)???
Yup. The "C" is called "accumulator" and that is what "returned" from the operation.
Let's do an example:
A = 2, B = 3 -> FMA -> 6 (why? because C was not set, i.e. C = 0 before it started, 2*3+ 0 = 6)
Now A = 3, B = 4 -> FMA -> 18 (because C is already 6 from the previous one, so 3*4 + 6 = 18)
Etc. etc.
Why do we need these FMA though? What's the point?
FMAs are the most common operation in graphics, like unbelievably common.
That's why the performance of GPUs is heavily optimized for FMA.
Essentially it's like a paint: A is the color and B is transparency/intensity of that color, C being accumulator is a result of applying layers of color on top of each other.
GPU is essentially applying hundreds of paint layers (per pixel) to get to the final result.
FP32 is: floating point with a 32bit length. It's just a measure of accuracy. Most operations are done in FP32 or FP16, but FP32 is the main number everybody talks about, when saying "flops", "teraflops" etc., we usually mean FP32 accuracy.

What's the performance of Pascal then?
Easy.
We need 3 numbers: how many computing units (CUDA cores) * how many FP32 per clock * how many clocks per second = TF number
For 1080 we have 128*20 = 2560 CUDA cores. But the clocks, clocks are tricky...
If we look at the whitepaper we would see two clocks: base and "boost". For 1080 it's 1.607GHz and 1.733GHz
So the performance may vary. But the whitepaper says: 8.873TF which is = 1.733*2560*2 i.e. it's based on the "boost" clock.

Gotcha! But GPU is never in the "boost" clock, that's what our Xbox fanboys say! How so?
Nope, it's wrong.
GPU is in the "boost" clock most of the time, in fact it goes even over that boost clock pretty frequently.
Let's open a clock graph from here.

What do we see? The "average" they measured is 1.857MHz.
Pretty high! Even higher than the "boost" clock.
What's going on here?
"There are lies, big lies, and then there's statistics!"
What we see here are "samples" taken at undetermined amounts of time with no duration attached.
Without knowing how frequently the clock speed changes and how long the card runs on these clocks the clock profile is useless.

Then, can we somehow deduce the real numbers?
Yep. we can.
To do that we need to load the GPU ALU constantly to the maximum FLOPS possible and do it for extended period of time.
Then we can really know what "sustained" load it can do.
I was pretty lazy to do it, but found somebody who did a pretty similar job (for the different purpose).

What do we see here?
On the top you see two NV GPUs with passive cooling, they throttle pretty hard.
On the bottom you will see all other GPUs with normal cooling, they run always at the max speed.
And the max speed is the boost clock.
So although you get spikes to higher than the boost clock, the real sustained speed of the GPU is the boost clock.
More than that, we don't really know if the higher numbers run for any significant amount of time.

So now we know that 8.873TF is the real number for 1080?
Yep. That's what NV data says.
There are a lot of cards from different manufacturers with different clocking profiles.
But this number is the only official number we have.

What about 1080Ti?
It's a wider Pascal card, it has 3584 cores * 2 FP32 * 1.582GHz boost = 11.34TF
So now we have at least two data points for the historical data: 8.87 and 11.34

Can we get to AMD now?
Yep. But for AMD I will choose a card that's not as historical: 5700XT.
Why? You'll see.
Let's open the RDNA(1) whitepaper.
We have the "dual compute units" (DC) that are groups of cores that actually calculate things.
And we have the "shader engines" (SE) that group DCs together.
For 5700XT we have 2 SEs with 10 DCs each (total 20DCs).
Each DC comes with 2*64 "vector ALU" which are our computation engines.
In the end we get to 20*2*64 = 2560 compute units.
Each ALU can pump 2 FMA FP32 per clock.

Isn't that exactly the same as NV 1080?
Yup. You guessed right.
It's very much the same.
The compute units are different, the scheduling logic is different, and there are quite lot of other things.
But it's still pretty similar.

But what about 5700XT TF?
Oh, we have something else here.
The clocks are higher.
5700XT has the following: base clock 1.605GHz, "game": 1.755GHz, "boost": 1905GHz
Unfortunately nobody profiled the AMD GPU with sustained loads (at least I couldn't find).
So we need to guess here. I would assume though that using "boost" clock is pretty safe.
Because of our "statistics" guys here. The power profile for AMD looks much simpler: use max voltage all the time.
That's why it probably indeed runs at "boost" mode all the time again.
Let's count the TF: 2560*2*1.905 = 9.75TF
Because 5700XT runs at higher clock it gets to higher TF numbers than 1080.
The difference is +9.9% on the 5700XT side.

But what about games?
Games are hard to measure, you need a lot of samples over a lot of games.
I would say that our statistics guys measured a lot of games in a lot of resolutions and came to the following.
If 1080 is 100% then 5700XT is 119%.
So, with ~10% TF advantage our new RDNA is ~20% faster than NV Pascal.
But why?
Because they use average over all the resolutions and tests.
Statistics, again.
For next-gen we are interested in 4K only.
If we check these 4K numbers we can see that it's ~10-11% there.

So it means that in 4K the TF difference translates to performance difference 1:1?
Yup. Pretty much.
Mainly because for current cards it's an ultimate test. A lot of ALU blocks used, 4x more than in 1080p and 2x more than 1440p.
I would say 4k and 8K are the only measures for the real TF perf right now.
It will change in the future, as more games embrace more complex calculations like RT and complex shaders. We can get to the max TF difference even in smaller resolutions.
But right now, in the older games that's what we should measure.

Does it mean that RDNA is better than Pascal when it's not really max TF?
Yes again.
RDNA is newer arch and thus performs much better overall.
Only when you push the computational load to the max you will see it fall to the TF difference.

Hahaha! But we have NV Turing now which is much more effective even in 4K! What about it?
You know.
You will need to sit down for this.
Because all of the above was just a history lesson.
We will talk about Turing, and you better sit tight.

Let's open the Turing whitepaper.
We have SMs (shader modules) which are assembled into GPCs again.
We have 6 GPCs with 12 SMs each. Which gets us to 72 SMs max.
Each SM has 64 CUDA cores, which didn't change much from Pascal.
Overall we have 4608 CUDA cores max.
But in 2080 only 46 SMs from 72 are enabled and we have only 2944 CUDA cores.

What about 2080 clocks then?
Good question!
They are lower now: 1.515Ghz / 1.71GHz
So we have +15% in cores over 5700XT, but -10% on the clock speed.
Overall the TF number is: 10.07TF vs 9.75TF

Pretty similar though! But 2080 is 15% faster overall and up to 30% faster in 4K! How's that?
We are getting there.
Let's open the Turing whitepaper again.
And enter the INT cores.
"In previous shader architectures, the floating-point math datapath sits idle whenever one of these non-FP-math instructions runs. Turing adds a second parallel execution unit next to every CUDA core that executes these instructions in parallel with floating point math. "
Yes you heard that right, in reality each Turing CUDA core can execute not 2 but 4 instructions in parallel.
Our 10.07TF is in fact 21.4TF!


Not so fast, darling! These are not FP32 operations, you cannot call it TF! And we do not know how many of these are executed! Explain yourself?
Yup. You're right.
But we do know how many of these are executed, and how much performance increase is expected!
Right from the horse mouth in the same whitepaper.
"Moving these instructions to a separate pipe translates to an effective 36% additional throughput possible for floating point."
So the new calculation becomes: 10.7TF*1.36 = 14.55TF
That's the actual tested by NV performance of 2080.
But we all know that vendors do not like to use tests that come against their narrative. I would suspect that this +36% increase is pretty inflated though.
I would think that usually it's probably 20% tops.
Why? Because that's what the tests in games show! Remember that +30% in 4K figure?
Let's dissect it: +10% TF (10.7 vs 9.75) vs +30% in game = +18% from the INT cores
Now it seems about right (NV inflated it 2x, which is easy to do when curating tests).

So, in the end Turing architecture is worse than RDNA??
Yup. Guys, no way to deny it.
2x cores leads to +18% perf increase....
I'm kidding, it's not that bad.
INT operations are much simpler and the cores are much smaller, ~1/3 of "normal" ones.
Therefore it's not that bad +30% in ALU size -> +18% in performance.
But it's an old process, when NV gets to a new one everything will be ok.
For now though, Turing is kind of meh.

Questions?
 

-Arcadia-

Gold Member
Aug 20, 2019
3,940
12,156
645
You need a TL;DR for this one. That’s a rough read for those of us not overly familiar with spec talk, or just people browsing casually. I’d like to be able to see your simplified conclusion of how the Teraflops differ at the bottom of the post, then scroll back up and see the details of how you arrived at it.
 

psorcerer

Member
May 1, 2012
1,528
2,039
920
You need a TL;DR for this one. That’s a rough read for those of us not overly familiar with spec talk, or just people browsing casually. I’d like to be able to see your simplified conclusion of how the Teraflops differ at the bottom of the post, then scroll back up and see the details of how you arrived at it.
There is a TL;DR at the top.
You can compare 1:1
 

Armorian

Member
Jan 17, 2018
1,471
1,621
430
There is a TL;DR at the top.
You can compare 1:1
Thats what I thought. I said few times that XSX will be comparable to 2070S or 2080 and people laughed at me. And this is the truth :messenger_sunglasses:

 

hyperbertha

Member
Nov 24, 2018
646
1,491
385
Thats what I thought. I said few times that XSX will be comparable to 2070S or 2080 and people laughed at me. And this is the truth :messenger_sunglasses:

This post uses RDNA 1 as the basis not RDNA 2. And that gears 5 thing isn't even relevant. Unoptimized port vs fully optimized. Lets see how it performs at full optimization.
 
  • Like
Reactions: Armorian

psorcerer

Member
May 1, 2012
1,528
2,039
920
Thats what I thought. I said few times that XSX will be comparable to 2070S or 2080 and people laughed at me. And this is the truth :messenger_sunglasses:
Yep. XSeX performance should be close to 2080. And it's feature set is pretty close too.
 
Oct 23, 2016
220
176
320
26
Argentina
No.

Cerny:



As Eurogamer said:

"a smaller GPU can be a more nimble, more agile GPU, the inference being that PS5's graphics core should be able to deliver performance higher than you may expect from a TFLOPs number that doesn't accurately encompass the capabilities of all parts of the GPU."

And in another article:

"it's important to remember that performance from an RDNA compute unit far outstrips a PS4 or PS4 Pro equivalent, based on an older architecture"



You can't compare TFLOPS between different architectures.
 
  • Like
Reactions: Leonidas

psorcerer

Member
May 1, 2012
1,528
2,039
920
"it's important to remember that performance from an RDNA compute unit far outstrips a PS4 or PS4 Pro equivalent, based on an older architecture"
That's correct. GCN has a pretty different shader block organization.

The rest is meh, just a spin.
 
Dec 14, 2008
33,132
1,240
1,240
This OP has a lot of words and charts and yet all we had to do is compare using the standard measurement of Gamecubes.

 
Last edited: