TL;DR 1 Ampere TF = 0.72 Turing TF, or 30TF (Ampere) = 21.6TF (Turing)
Welp, at least you aren't claiming it's .5 Turing TF like you did in the other threads...but you're still wrong. Here's why.
Reddit Q&A
To accomplish this goal, the Ampere SM includes new datapath designs for FP32 and INT32 operations. One datapath in each partition consists of 16 FP32 CUDA Cores capable of executing 16 FP32 operations per clock. Another datapath consists of both 16 FP32 CUDA Cores and 16 INT32 Cores. As a result of this new design, each Ampere SM partition is capable of executing either 32 FP32 operations per clock, or 16 FP32 and 16 INT32 operations per clock. All four SM partitions combined can execute 128 FP32 operations per clock, which is double the FP32 rate of the Turing SM, or 64 FP32 and 64 INT32 operations per clock.
Read that bolded part carefully.
So, Turing GPU can execute 64INT32 + 64FP32 ops per clock per SM.
Ampere GPU can either execute 64INT32 + 64FP32 or 128FP32 ops per clock per SM.
Good. Addition, yay! So let's see where you went wrong...
Which means if a game executes 0 (zero) INT32 instructions then Ampere = 2xTuring
And if game executes 50/50 INT32 and FP32 then Ampere = Turing exactly.
And there it is. You can't only account for raw numerical performance; node process efficiency gains have to also be taken into account. So even if the numerical numbers between the two average out to being the same across both architectures, Ampere still sees IPC gains simply by being on a newer process (not to mention having other hardware present to offload certain taskwork more efficiently than present in Turing, such as RT, DLSS and AI through the Tensor cores. Equivalent performance in those areas on Turing would've required more raw GPU resources expended to cover the gap).
Some math: 36 / (100+36) = 26%, i.e. in an average game instruction stream 26% are INT32
I don't see why you're doing the math this way. Nvidia says they see 36
ADDITIONAL INT32 OPs for each 100 FP32 OPs. Just previously you listed Turing as 64 INT32 + 64 FP32, and one of Ampere's as the same. So would this above division not be worthless at that point? In both cases you get 128 OPs per SM per cycle; the two Ampere numbers are clearly an either/or, the SMs can operate either in full FP32 or mixed FP32/INT32 OP modes on a cycle.
So we can now calculate what will happen to both Ampere and Turing when 26% INT32 + 74% FP32 instruction streams are used.
I have written a simple software to do that. But you can calculate an analytical upper bound easily: 74%/50% = 1.48 or +48%
My software shows a slightly smaller number +44% (and that's because of the edge cases where you cannot distribute the last INT32 ops in a batch equally, as only one pipeline can issue INT32 per each block of 16 cores)
So the theoretical absolute max is +48%, in practice the absolute achievable max is +44%
Thus each 2TF of Ampere have only 1.44TF of Turing performance.
None of your calculations make sense in terms of context here. By the description at the top of your post, both Turing and Ampere are capable of the
same number of INT32 OPs per clock cycle. Which means your numbers here should be reflective on both Ampere
AND Turing, which ultimately means that the performance delta between an Ampere TF and Turing TF stays the same i.e 2TF Ampere would = 2 TF Turing (before factoring in node gains and improvements on earlier tech, API algorithms etc. present on Turing continued in Ampere, which would actually increase the Ampere performance over Turing, not decrease it).
Let's check the actual data Nvidia gave us:
3080 = 30TF (ampere) = 21.6TF (turing) = 2.14x 2080 (10.07TF turing)
Wrong; you seem to have forgotten that even with Ampere's new pipeline architecture they are capable of same INT32 OPs per clock cycle as Turing, and your numbers only applied the conditional to Ampere while ignoring doing the same with Turing (even though you claimed you were going to do so in a sentence before doing these calculations).
Nvidia is even more conservative than that and gives us: 3080 = 2x2080
Do you have a source where they specifically phrased 3080 performance in this manner?
3070 = 20.4TF (ampere) = 14.7TF (turing) = 1.86x 2070 (7.88TF turing)
Same as above two.
Nvidia is massively more conservative here giving us: 3070 = 1.6x2070
Again, where is a source that quotes
official reps from Nvidia claiming this exact figurative comparison metric? You can't claim they said this or that if not able to source it yourself.
Actually if we average the two max numbers that Nvidia gives us (they explicitly say "up to") we get to even lower theoretical max of 1 Ampere TF = 0.65 Turing TF
Yes, "up to", as in, depending on what the game itself requires to be performed for calculations. Actually let's go back for a bit because I think you misread this following quote:
First, the Turing SM adds a new independent integer datapath that can execute instructions concurrently with the floating-point math datapath. In previous generations, executing these instructions would have blocked floating-point instructions from issuing.
So reading this again, it really does look like you got wonky with your calculations because it's Turing that would be hindered by running INT32 instructions on a clock cycle, not Ampere, since FP32 instructions would have to wait their turn until INT32 instructions are completed.
Which suggests that maybe these new FP32/INT32 mixed pipelines cannot execute FP32 at full speed (or cannot execute all the instructions).
Don't see where you're getting this from, especially considering I looked at your calculations and they seem dubious at best IMHO.
We do know that Turing had reduced register file access in INT32 (64 vs 256 for FP32) if it's the same (and everything suggests that Ampere is just a Turing facelift) then obviously not all FP32 instruction sequences can run on these pipelines.
Interesting speculation, but in light of what you've posted before, I don't know if the foundation of this speculation is necessarily sound.
Anyway a TF table:
| Ampere TF | Turing TF (me) | Turing TF (NV) |
3080 (Ampere) | 30 | 21.6 | 19.5 |
3070 (Ampere) | 20.4 | 14.7 | 13.3 |
2080Ti (Turing) | 18.75 (me) or 20.7 (NV) | 13.5 | 13.5 |
2080 (Turing) | 14 (me) or 15.5 (NV) | 10.1 | 10.1 |
2070 (Turing) | 10.4 (me) or 11.5 (NV) | 7.5 | 7.5 |
So this is basically a recap of your calculations that I already touched on above, no need to repeat myself.
Needless to say, I think the context and conclusions of your calculations are inaccurate, because I don't think you initialized conditions for those calculations correctly.
Bonus round: RDNA1 TF
RDNA1 has no INT32 pipeline, all the INT32 instructions are handled in the main stream. Thus it's essentially almost exactly the same as Ampere, but it has no skew in the last instruction thus +48% theoretical max applies here (Ampere +2.3%)
| Ampere TF | Turing TF (me) | Turing TF (NV) |
5700XT (RDNA1) | 10.01 | 7.2 | ? |
Amusingly enough 5700XT
actual performance is pretty similar to 2070 and these adjusted TF numbers show exactly that (10TF vs 10-11TF)
I'm not interested in discussing RDNA1 here as the crux of the discussion is on your (IMHO) flawed/inaccurate Ampere/Turing calculations, but needless to say I wouldn't be completely confident in these stated numbers either
I don't get why everyone is jumping on psorcerer. You should be happy that he's pointing out that the TFLOPS advertised by nVidia require some additional awareness.
Personally I've no problem with anyone who wants to deep-dive into numbers these companies provide us. However, at least after looking over the OP's conditions for their calculations, I don't think they're accurate. Not the calculations
themselves, but the
foundation for initiating them and the
context, because there are parts of the details NV provided he either ignored, didn't catch, and then did calculations with conditionals on only one side rather than both as they seemingly said they would've done on the outset.
If the conditions and context for calculations look suspect, I think that is worth questioning, long as it's respectful. FWIW there's been a rather strong push by some to downplay Nvidia's stuff, especially on the I/O front, following their presentation. If you look a little deeper you can infer
why some people are doing it, too, but I'll leave that for another time and in fact I don't think it's really necessary to say why at this point