But let's do it.

**TL;DR you can compare NV and AMD cards by TF just fine, if you know what you're doing.**

Let's start with some history.

In the year 2016 NV released a new architecture: NV

**Pascal**.

The top card using that arch at that time was GF 1080.

So, let's deep dive into the whitepaper.

We are interested in the blocks that actually do the computations.

NV calls it

**Streaming Multiprocessors (SMs)**.

SMs are organized in groups with some other logic and a re called

**GPC**.

Our

**1080**has 4 GPCs, with 5 SMs each (total

**20 SMs**)

What's doing the job inside the SM?

Each SM contains

**128 CUDA cores**and that's our computational blocks.

Each CUDA core can do

**2 FP32 FMA**operations per clock.

What is

**FMA**? FP32?

FMA = fused multiply add. It's an operation on two operands:

**A * B + C**.

But there are three (A, B, C)???

Yup. The "C" is called "accumulator" and that is what "returned" from the operation.

Let's do an example:

A = 2, B = 3 -> FMA -> 6 (why? because C was not set, i.e. C = 0 before it started, 2*3+ 0 = 6)

Now A = 3, B = 4 -> FMA -> 18 (because C is already 6 from the previous one, so 3*4 + 6 = 18)

Etc. etc.

Why do we need these FMA though? What's the point?

FMAs are the most common operation in graphics, like unbelievably common.

That's why the performance of GPUs is heavily

**optimized for FMA**.

Essentially it's like a paint:

**A**is the color and

**B**is transparency/intensity of that color,

**C**being accumulator is a result of applying layers of color on top of each other.

GPU is essentially applying hundreds of paint layers (per pixel) to get to the final result.

FP32 is: floating point with a 32bit length. It's just a measure of accuracy. Most operations are done in FP32 or FP16, but FP32 is the main number everybody talks about, when saying "flops", "teraflops" etc., we usually mean FP32 accuracy.

What's the performance of Pascal then?

Easy.

We need 3 numbers:

**how many computing units (CUDA cores) * how many FP32 per clock * how many clocks per second = TF number**

For

**1080**we have 128*20 = 2560 CUDA cores. But the clocks, clocks are tricky...

If we look at the whitepaper we would see two clocks: base and "boost". For

**1080**it's 1.607GHz and 1.733GHz

So the performance may vary. But the whitepaper says:

**8.873TF**which is = 1.733*2560*2 i.e. it's based on the "boost" clock.

Gotcha! But GPU is never in the "boost" clock, that's what our Xbox fanboys say! How so?

Nope, it's wrong.

GPU is in the "boost" clock most of the time, in fact it goes even over that boost clock pretty frequently.

Let's open a clock graph from here.

What do we see? The "average" they measured is 1.857MHz.

Pretty high! Even

**higher than the "boost" clock**.

What's going on here?

"There are lies, big lies, and then there's statistics!"

What we see here are "samples" taken at undetermined amounts of time with no duration attached.

Without knowing how frequently the clock speed changes and how long the card runs on these clocks the clock profile is useless.

Then, can we somehow deduce the real numbers?

Yep. we can.

To do that we need to load the GPU ALU

**constantly**to the

**maximum FLOPS possible**and do it for

**extended period of time**.

Then we can really know what "sustained" load it can do.

I was pretty lazy to do it, but found somebody who did a pretty similar job (for the different purpose).

What do we see here?

On the top you see two NV GPUs with

**passive cooling**, they throttle pretty hard.

On the bottom you will see all other GPUs with normal cooling, they

**run always at the max speed**.

And the max speed is the boost clock.

So although you get spikes to higher than the boost clock, the real sustained speed of the GPU is the boost clock.

More than that, we don't really know if the higher numbers run for any significant amount of time.

So now we know that

**8.873TF**is the real number for

**1080**?

Yep. That's what NV data says.

There are a lot of cards from different manufacturers with different clocking profiles.

But this number is the only official number we have.

What about

**1080Ti**?

It's a wider Pascal card, it has 3584 cores * 2 FP32 * 1.582GHz boost =

**11.34TF**

So now we have at least two data points for the historical data:

**8.87**and

**11.34**

Can we

**get to AMD**now?

Yep. But for AMD I will choose a card that's not as historical:

**5700XT**.

Why? You'll see.

Let's open the

**RDNA**(1) whitepaper.

We have the "dual compute units" (DC) that are groups of cores that actually calculate things.

And we have the "shader engines" (SE) that group DCs together.

For

**5700XT**we have 2 SEs with 10 DCs each (total

**20DCs**).

Each DC comes with 2*64 "vector ALU" which are our computation engines.

In the end we get to 20*2*64 =

**2560**compute units.

Each ALU can pump

**2 FMA FP32**per clock.

Isn't that

**exactly the same**as NV

**1080**?

Yup. You guessed right.

It's very much the same.

The compute units are different, the scheduling logic is different, and there are quite lot of other things.

But it's still pretty similar.

But what about

**5700XT**TF?

Oh, we have something else here.

The clocks are higher.

**5700XT**has the following: base clock 1.605GHz, "game": 1.755GHz, "boost": 1905GHz

Unfortunately nobody profiled the AMD GPU with sustained loads (at least I couldn't find).

So we need to guess here. I would assume though that using "boost" clock is pretty safe.

Because of our "statistics" guys here. The power profile for AMD looks much simpler: use max voltage all the time.

That's why it probably indeed runs at "boost" mode all the time again.

Let's count the TF: 2560*2*1.905 =

**9.75TF**

Because 5700XT runs at higher clock it gets to higher TF numbers than 1080.

**The difference is +9.9% on the 5700XT side.**

But what about games?

Games are hard to measure, you need a lot of samples over a lot of games.

I would say that our statistics guys measured a lot of games in a lot of resolutions and came to the following.

**If 1080 is 100% then 5700XT is 119%.**

So, with

**~10% TF**advantage our new RDNA is

**~20% faster**than NV Pascal.

But why?

Because they use average over all the resolutions and tests.

Statistics, again.

For next-gen we are interested

**in 4K only**.

If we check these 4K numbers we can see that it's

**~10-11%**there.

So it means that

**in 4K**the

**TF difference**translates to performance difference

**1:1**?

Yup. Pretty much.

Mainly because for current cards it's an ultimate test. A lot of ALU blocks used, 4x more than in 1080p and 2x more than 1440p.

I would say 4k and 8K are the only measures for the real TF perf right now.

It will change in the future, as more games embrace more complex calculations like RT and complex shaders. We can get to the max TF difference even in smaller resolutions.

But right now, in the older games that's what we should measure.

Does it mean that

**RDNA is better than Pascal**when it's not really max TF?

Yes again.

RDNA is newer arch and thus performs much better overall.

Only when you push the

**computational load**to the max you will see it fall to the TF difference.

Hahaha! But we have

**NV Turing**now which is much

**more effective**even

**in 4K**! What about it?

You know.

You will need to sit down for this.

Because all of the above was just a history lesson.

We will talk about Turing, and you better sit tight.

Let's open the

**Turing**whitepaper.

We have SMs (shader modules) which are assembled into GPCs again.

We have 6 GPCs with 12 SMs each. Which gets us to 72 SMs max.

Each SM has 64 CUDA cores, which didn't change much from

**Pascal**.

Overall we have

**4608 CUDA cores max**.

But in

**2080**only

**46 SMs**from 72 are enabled and we have only

**2944**CUDA cores.

What about

**2080**clocks then?

Good question!

They are lower now: 1.515Ghz / 1.71GHz

So we have

**+15%**in cores over

**5700XT**, but

**-10%**on the clock speed.

Overall the TF number is:

**10.07TF**vs

**9.75TF**

Pretty similar though! But

**2080**is

**15% faster**overall and up to

**30% faster in 4K**! How's that?

We are getting there.

Let's open the Turing whitepaper again.

And enter the

**INT cores**.

"

*In previous shader architectures, the*"

**floating-point math**datapath sits idle whenever one of these**non-FP-math**instructions runs. Turing**adds a second parallel execution unit next to every CUDA core**that executes these instructions in parallel with floating point math.Yes you heard that right, in reality

**each**Turing CUDA core can execute

**not 2 but 4 instructions in parallel.**

Our 10.07TF is in fact 21.4TF!

Our 10.07TF is in fact 21.4TF!

Not so fast, darling! These are

**not FP32**operations, you cannot call it

**TF**! And we do not know how many of these are executed! Explain yourself?

Yup. You're right.

But we do know how many of these are executed, and how much

**performance increase**is expected!

Right from the horse mouth in the same whitepaper.

"

*Moving these instructions to a separate pipe*

**translates to an effective 36% additional throughput**possible**"**

*for floating point.*So the new calculation becomes: 10.7TF*1.36 =

**14.55TF**

That's the actual tested by NV performance of

**2080**.

But we all know that vendors do not like to use tests that come against their narrative. I would suspect that this

**+36% increase is pretty inflated**though.

I would think that usually it's probably

**20%**tops.

Why? Because that's what the tests in games show! Remember that +30% in 4K figure?

Let's dissect it:

**+10% TF**(10.7 vs 9.75) vs

**+30% in game**=

**+18%**from the

**INT cores**

Now it seems about right (NV inflated it 2x, which is easy to do when curating tests).

So, in the end

**Turing**architecture is worse than

**RDNA**??

Yup. Guys, no way to deny it.

2x cores leads to +18% perf increase....

I'm kidding, it's not that bad.

INT operations are much simpler and the cores are much smaller, ~1/3 of "normal" ones.

Therefore it's not that bad

**+30% in ALU size -> +18% in performance**.

But it's an old process, when NV gets to a new one everything will be ok.

For now though, Turing is kind of meh.

Questions?