While looking into Nvidia's upcoming Parker chip, I figured it would be worth discussing the theoretically possible, but still very, very unlikely scenario that Nintendo could use the same SoC die across both handheld and home versions of NX with simply a substantial difference in clock speeds between the two. I should emphasise that I don't consider Parker suitable to be this chip (for reasons I'll go into momentarily), but it's worth looking into Parker as a guide for where Nvidia's plans are with Tegra:
- Manufactured on TSMC's 16FF+ process
- Hex-core CPU: 2x Denver, 4x A57
- "Pascal" GPU, likely Maxwell-style 128 ALU SMs, probably 3-4 SMs (ie 384 or 512 ALUs)
- GPU clocks should exceed 1GHz, possibly around 1.2GHz
- LPDDR4 memory, 128-bit, 51GB/s
For the reasons I wouldn't expect this to be suitable for, well, either device, to be honest with you, the first is those Denver cores. Without going into too much detail, Denver isn't actually a "true" ARMv8 CPU. It uses an internal VLIW-based instruction set and dynamically recompiles ARM code to that instruction set. This gets in the way of two things that are vitally important to a console/handheld CPU: predictable performance and straightforward optimisation. Denver has shown itself to have fairly erratic performance in its debut in the Nexus 9, performing well in certain situations and poorly in others, depending on how well suited they are to its peculiar architecture. I wouldn't be all that confident in its ability to run, for example, pathfinding routines with any degree of efficiency. On the optimisation front, trying to write ARM code which is optimised for Denver would be like trying to write x86 code which is going to be emulated on Itanium, i.e. something which would send even the best coders into the depths of insanity. Something to be avoided at all costs when you want to make porting to your platform as quick and painless as possible.
Secondly, while 51 GB/s is plenty of memory bandwidth for a handheld, it would be completely insufficient for the home console. Think of it like the XBO not being able to use SRAM at all and having to run everything off its main DDR3 pool, but with even less bandwidth than that.
That all being said, it may be worth considering what a hypothetical Tegra chip for both home console and handheld might look like. We'll call it the TN1:
- Manufactured on TSMC 16FF+
- CPU: 8x A72 (2GHz+ on home console, a lot less on handheld)
- GPU: 4 SMs, 512 ALU (1.2GHz+ on home console, ~300MHz on handheld, or 3 SMs at ~400MHz for yields)
- RAM: 4x 64-bit LPDDR4 (full 256-bit bus used on home console for ~120GB/s, 64-bit bus on handheld for ~30GB/s)
Before we get into the inherent craziness of Nintendo releasing a handheld with an SoC like this, let's look at the advantages Nintendo would get for using a single SoC across both devices:
- Reduced R&D cost: you only have to pay Nvidia to design a single die, and you only have to go through one tape-out and validation process.
- Simpler procurement: you only need to deal with a single order for a single piece of inventory, and you reduce inventory risk, as if (for example) the home console doesn't sell as well as you expect, you can use those chips for the handheld instead.
- Binning: You can bin dies for the different products, which usually isn't possible with semi-custom console chips. For example you can test the dies to see which ones run better at lower voltages and use those for the handheld. Alternatively, you can only enable 3 SMs in the handheld, allowing you to use dies which would otherwise be considered faulty.
- Perfect scaling: You want precisely five times the GPU performance in the home console versus the handheld? How about the exact same chip running at five times the clock?
- Handheld energy efficiency: Using a large, low-clocked GPU in the handheld will give better performance per Watt than a smaller, higher-clocked GPU would. (ie 512 ALUs at 300MHz will consume less power than 256 ALUs at 600MHz would for the same performance)
And the disadvantages:
- Chip cost: As most chips will end up in handhelds, you end up using a much bigger (and hence more expensive) die in the handheld than you need to for a given performance level, substantially increasing your costs.
- Limit to voltage binning: With let's say 75% of dies going into handhelds, there wouldn't be a huge gain from binning the most energy-efficient chips for the handheld. You'd get to design your handheld for the 25th percentile performance, which is better than the 0th percentile, but not by a whole lot.
- CPU choice: The optimal CPU cores for a handheld chip running at ~2W and a home console running an order of magnitude higher are going to be quite different. You either end up limiting the peak performance of the home console (say with A53s) or forcing the mobile CPU to run at extremely low frequencies (say with A57s/A72s).
CPU
The thing that really interests me from the above is the CPU choice. Unlike the GPU performance and memory quantity and bandwidth, CPU requirements don't scale down when you go from a 1080p console to a 540p handheld. Game logic doesn't vary with resolution, and if you want demanding games to run across both devices you'll want to squeeze as much CPU performance out of the handheld as possible.
As I've argued before, if you're designing an SoC for a handheld today, A53s make the most sense, as they provide the best performance at the kind of thermal limit (ie <1W) that's going to be allocated to a handheld CPU. The fact that they're so small also means you can squeeze eight or more of them on a small, cheap SoC and still get a fairly good amount of performance out of them. In this situation, though, they're probably not going to give the kind of performance you'd want from a home console CPU. They should clock to over 2GHz on 16FF+, and by blu's matrix mult benchmark they would actually comfortably outperform PS4's and XBO's CPUs at that clock, but in other circumstances their performance may be found a bit wanting for CPU-intensive multi-platform games. That pretty much leaves us with A72s.
Fortunately, there happens to be a very good resource on the performance and power consumption of A53 and A72 cores on 16FF+ in the form of
Anandtech's review of the Huawei Mate 8, which uses a 16FF+ Kirin 950 SoC with said cores. From this, we can estimate the kind of clock speed that we might expect to be able to run eight A72 cores in a handheld on a 16FF+ chip. We'll assume that the combination of 16nm and very low clocks has been extremely successful in bringing down the GPU's power consumption to the point where it consumes under 1W in operation, and there's a full 1W left over for the CPU (i.e. pretty much the best case scenario).
One challenge to estimating the achievable clocks, even with the data from the Anandtech article, is that the Kirin 950 applies a minimum supply voltage of 775mW to each A72 cluster at 1.5GHz and below, a conservative decision by Huawei on the basis that they're using early 16FF+ silicon and the A72 clusters spend most of their time north of 1.5GHz in any case. This won't be Nintendo's strategy, as they'll use a fixed clock, and will want to keep the supply voltage as low as possible to maintain that clock, to keep power consumption down. The manufacturing process would also be about 1 year more mature, so would be able to do so more reliably.
What this means is that, below 1.5GHz, the power consumption figures for the A72 clusters in the Kirin 950 wouldn't reflect the power consumption expected from A72 clusters in a Nintendo handheld. Fortunately, Anandtech does give us a few graphs and data points which we can use to estimate the actual clocks we may be looking at.
Working from the data available in the article, I've come to an estimated 800MHz clock speed for two quad-core clusters of A72s on 16FF+ in a 1W combined TDP. This is actually higher than I'd expected, but obviously it's a lot lower than they'd be clocked in a console environment (perhaps by a factor of 3). While there aren't many benchmarks I can find across the A72 and Jaguar, the most suitable one I can find from a gaming point of view would be the Geekbench single-core floating point test (as multi-core would include the A53s on big.little ARM SoCs). Taking this as a guide, the octo-core A72 at 800MHz would actually only perform about 20% worse than the 1.6GHz octo-core Jaguar used in the PS4. This is far closer than I would have thought for such a TDP-constrained CPU, and it would actually make ports of more CPU-intensive PS4 and XBO games within the ballpark of possibility. That being said, this is assuming a full 1W is available for the CPU (it could be half that) and assuming that Geekbench's floating point test is a reasonable analog for game performance (it may not be), so take the comparison with a grain of salt.
Cost
Aside from CPU performance, the cost of such a chip is something we could also look at, although with much less rigor and a much larger margin of error. Die cost is pretty much just a function of die size and manufacturing node, so first we'll try to estimate die size of our TN1. The CPU is the easy part, as ARM have told us that a quad-core cluster of A72s on 16FF+ with 2MB cache is around 8mm², giving us 16mm² for our CPU. The GPU is harder to estimate without any Pascal die photos to measure off, but using the absurd oversimplification that this GPU has 1/5th the SMs of GP104, therefore must be 1/5th the size, we've got a value of 63mm². The 128-bit LPDDR4 interface on the 16nm A9X takes up around 24mm² of space, so a 256-bit interface would need around 48mm². Then, add about 25% for the remaining blocks (audio, crypto, codecs, etc.) and we come to 159mm², which is to put it mildly a giant fucking die to try to squeeze into a handheld. The A9X is 147mm², though, so let's just roll with it as a hypothetical.
Said A9X is our best comparison point for the cost of the TN1, but we don't actually have any direct information on the A9X's cost. There's
a blog post which attempts to estimate the price of the A9X, and comes to the value of $37.30 including packaging, but it's worth keeping in mind that even the author admits that "there is room for error" in the estimate, so it could be north or south of that. I can't say I'd do any better, though, so let's take $37.30 as the cost of an A9X. There are a few aspects of the hypothetical TN1 which would both increase and decrease its price relative to the A9X. The first, and most obvious, is that it's appearing about 18 months later, meaning a more mature manufacturing process with higher yields and likely lower wafer costs, bringing down the price.
On the wafer cost side,
this paper (PDF) estimated a 5.5% reduction in unyielded wafer costs for 16nm FinFET from Q4/2015 to Q4/2016, and if we extrapolate that to 18 months we'd see an 8.25% wafer cost decrease since the A9X, which means we can assume a $7,686 cost per 300mm 16nm wafer from TSMC for the TN1 (going by the blog's $8,400 wafer estimate).
Yield improvements are much more difficult to estimate. The A9X calculation worked on an estimate of 65% yield for the 147mm², which works out to a fault rate of about 0.3% fault probability per mm². We would expect this fault probability to drop over time (increasing yields), although the increase die size will have the opposite effect. We do actually have a useful data point on this, which is the existence of the 16nm GP104 die in consumer products about half-way between the launch of the A9X and TN1. Nvidia reportedly sees about 60% gross margin from its high-end desktop GPU sales, and we would assume they wouldn't release the GTX 1080 unless it gave them similar margins to the product its replacing, so we should be able to assume that the price Nvidia sells the GTX 1080 chip to EVGA, Asus, etc. gives them about a 60% gross margin.
Taking the $599 price point of the GTX 1080, let's strip away 25% of that for retailer margins, leaving $449.25 going back to EVGA. Let's assume EVGA themselves work on around 15% margin and 10% goes on logistics, leaving $336.94 for the full graphics card, of which the major costs will be the GPU chip and the GDDR5X memory. The GDDR5X is obviously more expensive than GDDR5, perhaps significantly so given Nvidia's choice to not use it in the GTX 1070, but difficult to estimate. Regular GDDR5 is likely to have reduced quite a bit since Sony was reportedly paying $88 for 8GB of 5.5GT/s on a 256-bit bus, but the bump to 10GT/s GDDR5X may be equal and opposite, so let's just assume an $88 cost for the GDDR5X today as it's the only data point we have. That leaves $248.94 for the GPU chip and other components, of which we'll assume somewhere around $200 is the GPU. This puts the cost to Nvidia at around $80 for them to retain their 60% gross. With an estimated $8,053 wafer cost, this would indicate a fault rate of pretty close to 0.2% per mm² for 16FF+ at the moment.
Now, the reduction in fault rate won't be linear over time, and should be expected to be closer to an exponential curve, with fairly rapid reduction early on, followed by much slower reduction in faults as the process matures. With only two data points, it's hard to estimate where we are in that curve, but we're probably past the biggest reductions if a chip like GP104 is even viable. Hence, I'm going to estimate that yields for our March 2017 launch TN1 will be in the order of 0.15% per mm², not as big a jump from the A9X to the GP104, but a reasonable enough jump for a still-maturing node.
Given a $7,686 wafer cost, a 0.15% fault probability per mm² and a 159mm² die, my calculations give me a "raw" die cost of $25.86. Add about $5 for packaging to give $30.86, and then a 15% gross margin for Nvidia (which is roughly what AMD are getting for their semi-custom chips) to give a final cost to Nintendo of $36.31 per TN1.
Is that feasible for a Nintendo handheld? Well,
IHS estimated the cost of the 3DS's SoC at around $10, so it's a hell of a lot more than they've spent before. On the other hand, Nintendo reportedly spent $33.80 on the 3DS's 3D screen, so perhaps they're not entirely unwilling to spend that kind of money on a handheld component.
Handheld BoM
Let's assume for a second that, apart from the TN1 and RAM, Nintendo is keeping every other component in the device on the cheapest end of the spectrum as possible to get this to work. Then, aside from SoC and RAM, you'd be looking at a BoM similar to cheap $70 4.5" 480p smartphones like the Huawei Y560 or Honor Bee. Nintendo would be adding physical controls, but the modem wouldn't be needed, and the screen, battery, etc. would all be very similar. These are obviously sold at extremely thin margins, similar to a console, so if they're selling for about $70 we're probably looking at a BoM of $35-$40, or closer to $30 once you remove the SoC and RAM. We'll need to add the TN1 to that, but also RAM. For RAM, given the performance of the device, you'd probably be looking at 3GB of LPDDR4.
IHS's Galaxy S7 teardown estimates a price of $25 for a 4GB chip of LPDDR4 in early 2016 on a PoP package. Nintendo would be looking for 3GB, a year later, and without the expensive PoP packaging, so the price would certainly be lower, but it's still going to be relatively high-end RAM by then, so let's assume $17 for the 3GB. For the remaining changes, let's also assume that Nintendo will include more flash memory than these cheap phones, and the physical controls and Amiibo NFC chip add a bit of cost, bringing the entire BoM up by about $15.
So, after taking all this into account, our estimated BoM for a TN1-powered handheld with a 480p screen is $98.31. It should be emphasised that there are plenty of sources for error in this estimate, with many guesses along the way, but that does put us in the ballpark necessary to sell for $199 retail while breaking even. Which is kind of crazy for a handheld which could in theory handle PS4 ports, but such is technological progress.
Home console BoM
As the TN1 would also be used in an NX home console in this scenario, it's also worth looking at the cost implications there. A $36.31 SoC would be substantially cheaper than would usually be expected in a home console, particularly compared to the ~$100 chips used in the Wii U and PS4, but it would give them scope to spend more on other components while keeping price reasonable. The first of these would be RAM, where they could use the same 3GB LPDDR4 parts as the handheld for 12GB overall at $68. This may seem like a lot, but the total SoC+RAM cost would still be just $104, compared to $188 for the PS4 and $170 for the XBO at launch. If they drop the optical drive and hard drive (estimated at $28 and $37 respectively for PS4 launch, although both have reduced since), they could include a sizeable pool of flash (ie 256GB+) while still at least breaking even at a sub $300 launch price. At a 2GHz CPU clock and a 1.5GHz GPU clock we'd be looking at a TDP around 40W, so you'd be getting a fairly compact, power-efficient console for your money.
TLR
In theory, it seems that it would be possible for Nintendo to use a single 16nm chip in both the home NX and the handheld NX (clocked substantially lower in the latter), where the home console would have roughly 1.5 NV Tflops of GPU performance and would be capable of running PS4/XBO games, and the handheld would have roughly 300 NV Gflops of GPU performance and would be capable of running such games at 480p. That all being said, I absolutely don't expect it to happen, as even if Nintendo wanted these kinds of performance levels, they could do so at much lower cost if they used a different, smaller chip in the handheld.