The bolded isn't true. In fact, pretty much the opposite is true, as in a tightly thermally constrained environment (i.e. a handheld) the marginal benefit to increased parallelism (i.e. more SMs) can be quite large.
To demonstrate, let's look at a power curve for Pascal I've put together. Unlike
my previous power curves for A72 and A53 CPU clusters (which are based on solid real-world data from Anandtech and should be considered reasonably accurate), this is a much more rough approximation based on just four data points:
- TSMC's claims of "40% higher speed" and "60% power saving" over 20nm, each applied separately to the TX1's GPU drawing 1.5W at 500MHz (divided by 2 for 750mW per SM).
- Power draw readings from the GTX1080 before and after overclocking (full board power readings, minus GDDR5X, divided by number of SMs).
Obviously I'm extrapolating a lot from fairly poor data, but hopefully it should be in the right ballpark, and enough for our discussion in any case. (I should also note that this isn't strictly a measure of power draw for the SMs themselves, but rather a measure of the draw of an entire Pascal GPU "per SM", so including other components like ROPs, TMUs, etc., assuming they're always in roughly the same proportion to SMs). In any case, here's the power curve:
The important thing to note is that, like virtually all IC power curves, it's not linear, and for a given increase in clock speed you require a much larger increase in power consumption to get you there. What this means is that you'll get better performance by using more SMs at a lower clock speed than fewer SMs at a higher clock speed.
Let's look at the clock speed (and raw floating point performance) that could be achieved with different numbers of SMs within the power constrains we might expect for a handheld GPU:
1x SM:
1000 mW - 780 MHz - 200 Gflops FP32 - 400 Gflops FP16
1500 mW - 915 MHz - 234 Gflops FP32 - 468 Gflops FP16
2000 mW - 1025 MHz - 262 Gflops FP32 - 525 Gflops FP16
2x SM:
1000 mW - 595 MHz - 305 Gflops FP32 - 609 Gflops FP16
1500 mW - 700 MHz - 358 Gflops FP32 - 717 Gflops FP16
2000 mW - 780 MHz - 400 Gflops FP32 - 800 Gflops FP16
3x SM:
1000 mW - 510 MHz - 392 Gflops FP32 - 783 Gflops FP16
1500 mW - 600 MHz - 461 Gflops FP32 - 922 Gflops FP16
2000 mW - 670 MHz - 515 Gflops FP32 - 1030 Gflops FP16
As you can see, a 3x SM configuration can achieve nearly the same performance with 1000mW that a 2x SM configuration can with twice that, and a full 50% more than a 1x SM config can manage with 2000mW at hand.
This isn't to say that I expect a 3x SM GPU in the NX, but there would certainly be a sizeable performance jump over 2x SMs if they decided to do so.