• Hey, guest user. Hope you're enjoying NeoGAF! Have you considered registering for an account? Come join us and add your take to the daily discourse.

Possible hint at AMD's next-gen APU (codename: Gonzalo) - 8 cores, 3.2GHz clock, Navi 10-based GPU

onQ123

Member
As @LordOfChaos pointed: off the shelf Navi won't feature mcm design, hopefully Navi its post GCN and broke the shell surpassing its limits

I wasn't aware Nvidia hit a core limit, or is this speculation


In the paper they talk about needing to use Multi GPU modules to go above 128 SMs & that 256 SMs is 4.5X more SMs than the largest Nvdia single cards out now .

also Nvidia moved from 128 Cuda cores in a SM to 64 Cuda cores





3.2 MCM-GPU and GPM Architecture As discussed in Sections 1 and 2, moving forward beyond 128 SM counts will almost certainly require at least two GPMs in a GPU. Since smaller GPMs are significantly more cost-effective [31], in this paper we evaluate building a 256 SM GPU out of four GPMs of 64 SMs each. This way each GPM is configured very similarly to today’s biggest GPUs. Area-wise each GPM is expected to be 40% - 60% smaller than today’s biggest GPU assuming the process node shrinks to 10nm or 7nm. Each GPM consists of multiple SMs along with their private L1 caches. SMs are connected through the GPM-Xbar to a GPM memory subsystem comprising a local memory-side L2 cache and DRAM partition. The GPM-Xbar also provides connectivity to adjacent GPMs via on-package GRS [45]



1 INTRODUCTION GPU-based compute acceleration is the main vehicle propelling the performance of high performance computing (HPC) systems [12, 17, 29], machine learning and data analytics applications in large-scale cloud installations, and personal computing devices [15, 17, 35, 47]. In such devices, each computing node or computing device typically consists of a CPU with one or more GPU accelerators. The path forward in any of these domains, either to exascale performance in HPC, or to human-level artificial intelligence using deep convolutional neural networks, relies on the ability to continuously scale GPU performance [29, 47]. As a result, in such systems, each GPU has the maximum possible transistor count at the most advanced technology node, and uses state-of-the-art memory technology [17]. Until recently, transistor scaling improved single GPU performance by increasing the Streaming Multiprocessor (SM) count between GPU generations. However, transistor scaling has dramatically slowed down and is expected to eventually come to an end [7, 8]. Furthermore, optic and manufacturing limitations constrain the reticle size which in turn constrains the maximum die size (e.g. ≈ 800mm2 [18, 48]). Moreover, very large dies have extremely low yield due to large numbers of irreparable manufacturing faults [31]. This increases the cost of large monolithic GPUs to undesirable levels. Consequently, these trends limit future scaling of single GPU performance and potentially bring it to a halt. An alternate approach to scaling performance without exceeding the maximum chip size relies on multiple GPUs connected on a PCB, such as the Tesla K10 and K80 [10]. However, as we show in this paper, it is hard to scale GPU workloads on such “multi-GPU” systems, even if they scale very well on a single GPU. This is due


2.1 GPU Application Scalability To understand the benefits of increasing the number of GPU SMs, Figure 2 shows performance as a function of the number of SMs on a GPU. The L2 cache and DRAM bandwidth capacities are scaled up proportionally with the SM count, i.e., 384 GB/s for a 32-SM GPU and 3 TB/s for a 256-SM GPU1 . The figure shows two different performance behaviors with increasing SM counts. First is the trend of applications with limited parallelism whose performance plateaus with increasing SM count (Limited Parallelism Apps). These applications exhibit poor performance scalability (15 of the total 48 applications evaluated) due to the lack of available parallelism (i.e. number of threads) to fully utilize larger number of SMs. On the other hand, we find that 33 of the 48 applications exhibit a high degree of parallelism and fully utilize a 256-SM GPU. Note that such a GPU is substantially larger (4.5×) than GPUs available today. For these High-Parallelism Apps, 87.8% of the linearly-scaled theoretical performance improvement can potentially be achieved if such a large GPU could be manufactured. Unfortunately, despite the application performance scalability with the increasing number of SMs, the observed performance gains are unrealizable with a monolithic single-die GPU design. This is because the slowdown in transistor scaling [8] eventually limits the number of SMs that can be integrated onto a given die area. Additionally, conventional photolithography technology limits the maximum possible reticle size and hence the maximum possible


Our optimized MCMGPU architecture achieves a 44.5% speedup over the largest possible monolithic GPU (assumed as a 128 SMs GPU), and comes within 10% of the performance of an unbuildable similarly sized monolithic GPU.


https://www.pcgamer.com/rtx-2080-everything-you-need-to-know/


Nvidia has reworked the SMs (streaming multiprocessors) and trimmed things down from 128 CUDA cores per SM to 64 CUDA cores. The Pascal GP100 and Volta GV100 also use 64 CUDA cores per SM, so Nvidia has standardized on a new ratio of CUDA cores per SM. Each Turing SM also includes eight Tensor cores and one RT core, plus four texturing units. The SM is the fundamental building block for Turing, and can be replicated as needed.



For traditional games, the CUDA cores are the heart of the Turing architecture. Nvidia has made at least one big change relative to Pascal, with each SM able to simultaneously issue both floating-point (FP) and integer (INT) operations—and Tensor and RT operations as well. Nvidia says this makes the new CUDA cores "1.5 times faster" than the previous generation, at least in theory.


All Turing GPUs announced so far will be manufactured using TSMC's 12nm FinFET process. The TU104 used in the GeForce RTX 2080 has a maximum of 48 SMs and a 256-bit interface, with 13.6 billion transistors and a die size measuring 545mm2. That's a huge chip, larger even than the GP102 used in the 1080 Ti (471mm2 and 11.8 billion transistors), which likely explains part of the higher pricing. The GeForce RTX 2080 disables two SMs but keeps the full 256-bit GDDR6 configuration.
 

Pimpbaa

Member
Is there a use case for this where it wouldn't be able to replicate the same on a hdd but longer load times?
With plenty of ram, streaming assets becomes more of a bottleneck (as speedy as ssd are they are no match for ram let alone vram), streaming was used heavily on ps360

Streaming has been used heavily since the PS2 (the GTA games would have been impossible without it). Increasing the ability to stream more data from your storage medium will always be beneficial. GTAV on last gen consoles needed to combine both their hard drives and their optical drives for the game to push the visuals it did (the digital version has severe streaming issues). Making a ssd standard on a next gen console combined with a large increase in CPU power will mean much more data will being able to streamed. Sure ram can be a bottleneck, but throwing in a 5400rpm hard drive(or even a 7200rpm) will be a far bigger one on what we expect next gen hardware to be. It would impact both load times and texture variety (and possibly resolution of those textures).
 

SonGoku

Member
SonGoku SonGoku at best gt6 looks about as good as forza 4, with all fully modeled cars on the track. Then forza runs at locked 60fps. But actually I think environment detail is more inconsistent as well and the cars can sound like lawnmowers. That's it though I'm done shitting on gt now lol.

Not true even need for speed shift on 360 is clearly more detailed than gt6 at its best or forza 4 and its 30fps.
gt6 is unbalanced I'll give you that, but its doing so much more than other racers from that gen its even pulling some current gen tricks (did you read the df article?). I think Yamauchi's ambition and unwillingness to compromise for the sake of perfection its what made gt6 run below 60 fps. If he was willing to make sacrifices, gt6 would be a better more balanced game but his ambition wants a sim
Hopefully with PS5 he can realize his vision.

shift is 30 fps btw and gt6 is still doing some stuff shift isn't
Streaming has been used heavily since the PS2 (the GTA games would have been impossible without it). Increasing the ability to stream more data from your storage medium will always be beneficial. GTAV on last gen consoles needed to combine both their hard drives and their optical drives for the game to push the visuals it did (the digital version has severe streaming issues). Making a ssd standard on a next gen console combined with a large increase in CPU power will mean much more data will being able to streamed. Sure ram can be a bottleneck, but throwing in a 5400rpm hard drive(or even a 7200rpm) will be a far bigger one on what we expect next gen hardware to be. It would impact both load times and texture variety (and possibly resolution of those textures).
That's just it though, streaming was done to make up for ram deficiencies, we are way past that. It would be more limiting with todays and tomorrows hw
PS360 were severely limited in memory and developers had to pull every trick in the book to make up for it

If the choice is between having a sad and 16gb vram vs hdd and 24gb vram, the latter combo wins in everything other than loading times. Having more ram will always produce better results.
In the paper they talk about needing to use Multi GPU modules to go above 128 SMs & that 256 SMs is 4.5X more SMs than the largest Nvdia single cards out now .also Nvidia moved from 128 Cuda cores in a SM to 64 Cuda cores
Thanks for the info, so in other words post GCN IS MCM. Whoa i hope this makes it to consoles otherwise we'll be stuck with a puny 10tf card
btw where is the indication of Nvidia's hitting a core limit? rearranging for efficiency aint evidence of that and ati/amd historically had higher core count so that doesn't prove anything either
 
Last edited:

Pimpbaa

Member
That's just it though, streaming was done to make up for ram deficiencies, we are way past that. It would be more limiting with todays and tomorrows hw
PS360 were severely limited in memory and developers had to pull every trick in the book to make up for it

If the choice is between having a sad and 16gb vram vs hdd and 24gb vram, the latter combo wins in everything other than loading times. Having more ram will always produce better results.

We are not way past that. Streaming is used in most games these days and issues with streaming still show up (textures not loading fast enough) and problems would increase if the storage medium is still slow ass hard drives next gen. What we are past is loading levels all in memory at once. No one does that anymore except for those making retro games. Having more ram and not having a storage medium that can keep up with the demand will leave a lot of ram wasted as it take too long to load or not load in textures fast enough during gameplay. 16 GB with a SSD would absolutely destroy 24GB with a hard drive. The former would have far less streaming issues, able to keep it's ram filled with graphic data providing far more diverse visuals. The latter wold choke trying to load next gen assets during game play. Especial if a game was designed around the read performance of a ssd.
 
Last edited:

SonGoku

Member
16 GB with a SSD would absolutely destroy 24GB with a hard drive. The former would have far less streaming issues, able to keep it's ram filled with graphic data providing far more diverse visuals. The latter wold choke trying to load next gen assets during game play. Especial if a game was designed around the read performance of a ssd.
The latter woulnt need to stream as much.... it could just preload it, that alone puts it on a different level
Especially a game designed for 24GB ram it would destroy the 16GB and SSD combo, alas with longer initial load times.
 
Last edited:

Pimpbaa

Member
The latter woulnt need to stream as much.... it could just preload it, that alone puts it on a different level
Especially a game designed for 24GB ram it would destroy the 16GB and SSD combo, alas with longer initial load times.

If you are relying on keeping stuff in memory due to the hard drive being too slow to keep up, it would mean less diverse visuals and lower resolution textures. And the bottleneck would be the hard drive and not the GPU like it should be. How you even tried a SSD in a game that is mostly hard drive bound on PC? Even an old game like WoW was a massive improvement (especially when going into a highly populated area). It utterly chokes going to a city in the game even with a 7200rpm HD trying to load all the textures for all the players in the area. More ram didn't help load that shit in, but a SSD that is able to handle many requests for data at the same time loads every player before a HD could even load a few. Now imagine a game designed around a SSDs read speeds and it's ability to handle so many requests for data, it would without question be a better experience. No amount of pre-loading can make up for that. Especially with such a small jump from 16 to 24GB.
 
well this topic is getting pretty diffuse now. might also update my predictions list from september.

and as anybody locks down their prediction, i might as well repeat what i said over the last year or so:

PS5 spec prediciton:
GPU: 7nm navi with at least 12TF based on boost clocks
CPU: 7nm Ryzen 3000 (Zen 2) 8cores/16threads boost clocked to at least at 3,2 Ghz probably higher
RAM: 16GB of GDDR6 (slim chance that being HBM3)
Bonus Prediction: GPU and CPU will be DIFFERENT dies on the same substrate connected via infinity fabric, as that is more economical than a big APU

PS5 spec prediciton:
GPU: 7nm navi with at least 12TF based on boost clocks
CPU: 7nm Ryzen 3000 (Zen 2) 8cores/16threads boost clocked to at least at 3,2 Ghz probably higher
RAM: 16+GB of GDDR6 (slim chance that being HBM3)
Bonus Prediction: GPU and CPU will be DIFFERENT dies on the same substrate connected via infinity fabric, as that is more economical than a big APU
Bonus Prediction2: will have some sort of (maybe customizable) power management system to let developers allocate power budget from GPU to CPU or vice versa

Wishlist Prediction: Hardware feature set for accelerating realtime GI (like voxel cone tracing) and ray tracing


-i think the likelihood for bonus prediction 1 is now at about 90% after we've seen zen 2.
-after we've seen Radeon VII / vega2 pricing and know more about the cost of hmb2 i think the chance for that being implemented droped to nil. we will see GDDR 6 with a wide bus (up to 384bit. but that necessarily would give you at least 18GB or 24GB total).
-Bonus prediction 2 was kinda implicated in the original list. if you wouldn't do that why bother with non-boost clocks at all (aside from BC).
 
Last edited:

SonGoku

Member
If you are relying on keeping stuff in memory due to the hard drive being too slow to keep up, it would mean less diverse visuals and lower resolution textures. And the bottleneck would be the hard drive and not the GPU like it should be. How you even tried a SSD in a game that is mostly hard drive bound on PC? Even an old game like WoW was a massive improvement (especially when going into a highly populated area). It utterly chokes going to a city in the game even with a 7200rpm HD trying to load all the textures for all the players in the area. More ram didn't help load that shit in, but a SSD that is able to handle many requests for data at the same time loads every player before a HD could even load a few. Now imagine a game designed around a SSDs read speeds and it's ability to handle so many requests for data, it would without question be a better experience. No amount of pre-loading can make up for that. Especially with such a small jump from 16 to 24GB.
This is what you are not getting: Streaming is done to overcome memory limitations, what you are saying would apply if you where comparing systems with the same amount of memory but you are not...
The 24GB system has a 8GB buffer advantage at all times, while the other system is busy streaming. The 24GB system would be two steps ahead at all times. 8GB won't be enough lol? thats more than what current consoles have to work with (5GB).

The problem with your example is that game is designed for low memory configurations and relies heavily on streaming because of it, of course more ram wont help if the game is not designed to use it. This would not be an issue on games designed to run with ample amounts of memory. A SSD just isn't worth the sacrifice in memory or gpu which is why it will never make it to consoles until it actually replaces HDDs. Best case scenario we get 100 to 250GB of high speed flash memory built in to the motherboard to act as a cache and even that would add cost im not sure console manufacturers we'll be willing to take.
 
Last edited:
SonGoku SonGoku yeah, I said shift is 30fps. You said gt6 could pass for a 30fps racer that gen in terms of detail... It can't. Pgr4 is noticeably more detailed than gt6 at its best or forza 4. Gt6 has a few really nice car models but I'm talking overall.

I agree gt6 would be best if they locked it at 30 and ditched all premium models without cockpits and ps2 cars, then ramp up environment detail and then it could be really special visually.

I'm not saying shift necessarily *looks* better than forza 4, its got kind of a hazy bloom look to it but the track detail is noticeably more detailed and it has 4x MSAA and good texture filtering on 360. It does look quite nice, best thing slightly mad studios ever did at least.
 
Last edited:
We are not way past that. Streaming is used in most games these days and issues with streaming still show up (textures not loading fast enough) and problems would increase if the storage medium is still slow ass hard drives next gen. What we are past is loading levels all in memory at once. No one does that anymore except for those making retro games. Having more ram and not having a storage medium that can keep up with the demand will leave a lot of ram wasted as it take too long to load or not load in textures fast enough during gameplay. 16 GB with a SSD would absolutely destroy 24GB with a hard drive. The former would have far less streaming issues, able to keep it's ram filled with graphic data providing far more diverse visuals. The latter wold choke trying to load next gen assets during game play. Especial if a game was designed around the read performance of a ssd.

I completely agree, as long as none of that 16gb is used for the OS - 16gb + ssd would be great.

This would have the added benefit of devs really starting to optimize their memory usage as games are quite bloated right now. For chrissakes Rare made conker and BT on n64 without the expansion pak... 4 megs ram!
 
Last edited:

onQ123

Member
Thanks for the info, so in other words post GCN IS MCM. Whoa i hope this makes it to consoles otherwise we'll be stuck with a puny 10tf card
btw where is the indication of Nvidia's hitting a core limit? rearranging for efficiency aint evidence of that and ati/amd historically had higher core count so that doesn't prove anything either


MCM isn't the only way around it 3D stacking ,redesigns & so on will happen also.



Someone from Nvidia is talking about needing multi GPU modules to get past 128 SM


we find that 33 of the 48 applications exhibit a high degree of parallelism and fully utilize a 256-SM GPU. Note that such a GPU is substantially larger (4.5×) than GPUs available today. For these High-Parallelism Apps, 87.8% of the linearly-scaled theoretical performance improvement can potentially be achieved if such a large GPU could be manufactured. Unfortunately, despite the application performance scalability with the increasing number of SMs, the observed performance gains are unrealizable with a monolithic single-die GPU design. This is because the slowdown in transistor scaling [8] eventually limits the number of SMs that can be integrated onto a given die area. Additionally, conventional photolithography technology limits the maximum possible reticle size and hence the maximum possible die size. For example, ≈ 800mm2 is expected to be the maximum possible die size that can be manufactured [18, 48]. For the purpose of this paper we assume that GPUs with greater than 128 SMs are not manufacturable on a monolithic die. We illustrate the performance of such an unmanufacturable GPU with dotted lines in Figure 2.


8 CONCLUSIONS Many of today’s important GPU applications scale well with GPU compute capabilities and future progress in many fields such as exascale computing and artificial intelligence will depend on continued GPU performance growth. The greatest challenge towards building more powerful GPUs comes from reaching the end of transistor density scaling, combined with the inability to further grow the area of a single monolithic GPU die. In this paper we propose MCM-GPU, a novel GPU architecture that extends GPU performance scaling at a package level, beyond what is possible today. We do this by partitioning the GPU into easily manufacturable basic building blocks (GPMs), and by taking advantage of the advances in signaling technologies developed by the circuits community to connect GPMs on-package in an energy efficient manner. We discuss the details of the MCM-GPU architecture and show that our MCM-GPU design naturally lends itself to many of the historical observations that have been made in NUMA systems. We explore the interplay of hardware caches, CTA scheduling, and data placement in MCM-GPUs to optimize this architecture. We show that with these optimizations, a 256 SMs MCM-GPU achieves 45.5% speedup over the largest possible monolithic GPU with 128 SMs. Furthermore, it performs 26.8% better than an equally equipped discrete multi-GPU, and its performance is within 10% of that of a hypothetical monolithic GPU that cannot be built based on today’s technology roadmap.
https://research.nvidia.com/sites/default/files/publications/ISCA_2017_MCMGPU.pdf
 
Last edited:

SonGoku

Member
SonGoku SonGoku yeah, I said shift is 30fps. You said gt6 could pass for a 30fps racer that gen in terms of detail... It can't. Pgr4 is noticeably more detailed than gt6 at its best or forza 4.
I said it could pass for a 30fps not the best looking :p Obviously sacrifices are made to hit "60 fps"
I also think Yamauchi could off better realized his vision on PS3 if he designed GT6 around locked 30 fps, it would look impressive and feel better than unstable "60 fps"
I'm not saying shift necessarily *looks* better than forza 4
You also said gt6 looks on par with fm4 when using all premium
Gt6 has a few really nice car models but I'm talking overall.
premium models without cockpits and ps2 cars, then ramp up environment detail and then it could be really special visually.
This has to do with time limitations not hardware
I completely agree, as long as none of that 16gb is used for the OS - 16gb + ssd would be great.
SSD not gonna happen next gen and if the choice is between more ram or ssd, more ram always wins
I can see them using a 100GB cache built into the motherboard though
MCM isn't the only way around it 3D stacking ,redesigns & so on will happen also.
Someone from Nvidia is talking about needing multi GPU modules to get past 128 SM
https://research.nvidia.com/sites/default/files/publications/ISCA_2017_MCMGPU.pdf
Oh good! thought you were talking about AMD earlier, so there is hope for a post gcn arch that surpasses the limits
Btw is he talking about a specific node process related to the 128sm limit? its not very clear

edit: Can't make sense of the math
Assuming linear scaring 256 is 100% faster (or 2x if you will) than 128 SM
The how come he comes to the conclusion that 45.5% speedup is within 10% of a hypothetical 256 sm monolithic gpu
 
Last edited:
SonGoku SonGoku To rephrase my comparison I would say gt6 had the edge in car models in its absolute best case scenario when everything is modeled properly. Quite the uncommon scenario, and then gt6 has noticeably more pop in than forza 4 and worse environment. It's about a draw in terms of what they push in that scenario but forza 4 is locked 60. Maybe gt6 has a tad better AA solution with its mlaa i guess. They could have a decade to work on gt6 and they still couldn't improve the detail beyond what they had in the best case scenario. The ps3 was not some supercomputer like we were told.
 
Last edited:

SonGoku

Member
SonGoku SonGoku To rephrase my comparison I would say gt6 had the edge in car models in its absolute best case scenario when everything is modeled properly. Quite the uncommon scenario, and then gt6 has noticeably more pop in than forza 4 and worse environment. It's about a draw in terms of what they push in that scenario but forza 4 is locked 60. Maybe gt6 has a tad better AA solution with its mlaa i guess. They could have a decade to work on gt6 and they still couldn't improve the detail beyond what they had in the best case scenario. The ps3 was not some supercomputer like we were told.
Of course it wasn't but its worth noting gt6 has so much going on its pushing some current gen tricks and effects
If Yamauchi was willing to compromise on his vision and make sacrifices in detail we are not even able to tell in movement it would have been much better looking overall, more balanced

btw what i mean by time limitation is the use of non premium tracks and cars as filler
 
Last edited:

onQ123

Member
I said it could pass for a 30fps not the best looking :p Obviously sacrifices are made to hit "60 fps"
I also think Yamauchi could off better realized his vision on PS3 if he designed GT6 around locked 30 fps, it would look impressive and feel better than unstable "60 fps"

You also said gt6 looks on par with fm4 when using all premium

This has to do with time limitations not hardware

SSD not gonna happen next gen and if the choice is between more ram or ssd, more ram always wins
I can see them using a 100GB cache built into the motherboard though

Oh good! thought you were talking about AMD earlier, so there is hope for a post gcn arch that surpasses the limits
Btw is he talking about a specific node process related to the 128sm limit? its not very clear

edit: Can't make sense of the math
Assuming linear scaring 256 is 100% faster (or 2x if you will) than 128 SM
The how come he comes to the conclusion that 45.5% speedup is within 10% of a hypothetical 256 sm monolithic gpu


Because if a 256 SM monolithic gpu could be made it wouldn't need to be connected through the same methods that multi-GPU Modules have to be connected., there for no performance would be lost from the limited bandwidth of the connection but their findings is that with the method they will be using they will be within 10% of this imaginary 256 SM monolithic GPU even with the limited bandwidth of the connectors (XBAR).

zZtc7V.jpg
 

ethomaz

Banned
To you guys have an ideia here the famous 3D staked GPU found in Xbox One to understand what onQ123 is saying.

OMjIySM.png


People needs to stop to dream something won't happen.
 
Last edited:

onQ123

Member
To you guys have an ideia here the famous 3D staked GPU found in Xbox One to understand what onQ123 is saying.

OMjIySM.png


People needs to stop to dream something won't happen.


This isn't about 3D stacking 3D stacking would be better than this because the chips would be directly connected


TSMC will be mass producing 10nm 3D SoIC chips in 2021


https://www.engineering.com/Hardwar...ss-to-Empower-NVIDIA-and-AMD-GPU-Designs.aspx

TSMC’s New Wafer-on-Wafer Process to Empower NVIDIA and AMD GPU Designs

Transistor density scaling is slowing down with Moore's law. This fact coupled with an inability to increase the area of a single GPU die are two crucial points of concern for graphics card manufacturers. And these manufacturers are powering advancements in GPU-based, high-performance computing applications such as artificial intelligence. They are expecting companies like Taiwan Semiconductor Manufacturing Company (TSMC) to increase their rate of innovation to accommodate the rate of innovation provided by Moore's law that is an industry standard.

What is a Multi-Chip Module (MCM) GPU Design?


To increase performance and artificially ensure that the rate of Moore’s law stays constant, package-level integration of multiple GPU modules that leverage high bandwidth and signaling technologies is necessary to increase power efficiency. To partition GPUs into modules (known as GPMs), chip designers have to optimize architecture to decrease latency between bandwidths that bind the stacked GPMs together. They also have to improve the efficacy of GPM data locality. These are just two primary factors motivating design who work toward making these Multi-Chip Module GPUs a convincing insurance policy against the slowing tide of Moore's law.

Since manufacturers like TSMC (Taiwan Semiconductor Manufacturing) have a vested interest as a silicon foundry and chipmaker, they have to ensure that their manufacturing capabilities can continue to scale up GPU performance for big clients like NVIDIA and AMD regardless of industry-wide transistor density scaling slowing down. TSMC recently showed off a promising solution for Wafer-on-Wafer (WoW) technology, which addresses latency between the different GPU clusters that make up an MCM based GPU.

To understand TSMC’s novel WoW innovative approach, the current approach must first be understood. An MCM is generally manufactured using a custom interconnect and an interposer. The interconnect is a bottleneck for the latency between modules in an MCM. Part of the reason for this is that the wafers in an MCM are positioned and connected laterally by the interconnect. And the interconnect is what primarily causes latency between the wafers.


TSV_twgma1.jpg


To get around this, TSMC’s new proposal involves the use of Through Silicon Vias (TSVs). These 10-micron holes allow the two silicon wafers to touch. This approach from TSMC is meant to demonstrate that stacking dies on top of one another can improve power efficiency and decrease latency lost between GPMs. (Image courtesy of TSMC.)​
What is an Interconnect?

The interconnect is one of the most crucial components of consideration when planning, designing, engineering and manufacturing MCMs and integrated circuits like System on a Chip (SoC) or ASIC (Application Specific Integrated Circuit).

Historically, interconnects were built using a wiring approach called subtractive aluminum—where blanket films of aluminum are deposited, patterned then etched—leaving isolated and exposed wires which are then coated in dielectric material. The vias are tiny etching holes, and they interconnect the two wafers to each other through the insulating material. Dynamic Random-Access Memory (DRAM) chips are built this way.

The interconnect layer wiring material has changed in recent years from aluminum to copper. Since the number of transistors that are interconnected on modern microprocessors has increased exponentially, timing delays in the wiring of interconnect levels increased as well, prompting the switch to copper. Copper has a 40 percent less capacitive resistance than aluminum, yielding a 15 percent faster processor speed overall.

Even after switching to the damascene copper manufacturing process, miniaturization of the wires of interconnects produces resistance-capacitance delay issues. Due to the length of the wires increasing while the width decreases, it becomes harder and harder to push electrons through them at the same rate of speed.

The combination of a plateauing rate of transistor scaling and increasing latency between GPMs is part of the reasons why TSMC is using Through-Silicon Vias (TSV) for 3D integration.

What Are the Benefits of TSMC’s Wafer on Wafer Tech for GPUs?

GPUs designed by NVIDIA and AMD and manufactured using TSMC’s wafer-on-wafer technology could become more powerful without increasing their physical size. Layers are stacked vertically rather than horizontally along the printed circuit board (PCB) like solid-state drives (SSDs).

Currently, NVIDIA and AMD GPUs are built from a single wafer, which is why TSMC’s research and development teams completed a goal to stack and bond two wafers, one above the other, on a single package. The single package is a cube-shaped and its two stacked wafers are connected by an electrical interface known as an interposer, which routes each connection to another.

Switching wafer scaling from horizontal to vertical may not seem like the most innovative engineering move of all time, but it is no simple task. The reason it’s going to have an impact on the industry is that TSMC can now offer NVIDIA and AMD the ability to add two GPUs to one graphics card as a “new” or refreshed product offering—without having to develop the architecture of a new GPU to fit more cores. This announcement and plan from TSMC is designed to reduce GPU performance anxiety about the future, as transistor scaling becomes harder to replicate at the exponential rate of years past.

GPU Gains from Wafer-on-Wafer Tech

Since the operating system would detect the twin-wafer GPU stack as one chip instead of a multi-GPU configuration, increasing capacity while using the exact same amount of room as a single card.

Like 3D NAND and DRAM, TSMC should be able to offer NVIDIA and AMD the ability to add more storage capacity, but manufacturing stacking processor wafers will likely prove to be a high-cost endeavor. During inspection, costs could accrue because if one wafer out of the two fails, both must be discarded.


TSV2_lhnrf7.jpg


The WoW approach from TSMC is similar to the way the dies are stacked like DRAM and NAND, which allows for faster interfacing and significantly more GPU cores to work with. (Image courtesy of TSMC.)​
TSMC’s customers include both NVIDIA and AMD, so this wafer-on-wafer stacking process will supposedly ensure that the core count can continue to increase as the technology for transistor scaling slows down industry-wide. It's too early to tell for sure if all the engineering bottlenecks have been addressed.

Won’t Wafer-on-Wafer Be Too Hot?

Mounting the wafers with TSVs leaves more of an air gap between the two wafers. But both components generate heat, which is problematic. The bottom chip’s heat will warm up to top wafer, and though the top wafer will be cooled more than the bottom one. So the heat sink would need to transfer heat away from the small area around the bottom wafer, which is costly to do.

Since WoW is costly, TSMC is likely to use on high-yield production nodes to reduce loss due to waste. But as far as addressing cooling or heating issues, TSMC is pretty mum. The prospect for AMD and NVIDIA to stack dies on top of each other and presto—double the GPU core count for chip refreshes without having to develop new GPU architectures means that GPU-based, high-performance computing applications such as artificial intelligence can continue to rely on historical rates of growth in GPU processing power.

https://en.ctimes.com.tw/DispNews.asp?O=HK2AN94TZR6SAA00NZ

TAIPEI, Taiwan - Recently, Taiwan Semiconductor Manufacturing Company Limited (TSMC) has mentioned a new technology multiple times – “System-on-Integrated-Chips (SoIC),” and at its Q3 earnings conference , they gave a more specific time table for mass production. TSMC estimates that in 2021 their SoIC technology will go into mass production.

What exactly is SoIC? According to TSMC’s previous descriptions on technical forums, SoIC is a type of innovative multi-chip stacking technology, which can be used to carry out wafer-bonding in the manufacture of chips which are 10nm and less. The technology has a bonding structure without protrusion which gives it higher performance.

Therefore, from the description, it is a type of wafer-on-wafer bonding technology. Currently TSMC is collaborating on this with EDA tool vendors to introduce design and verification tools for manufacturing technology.

More specifically, it may be a type of 3D IC manufacturing technology, and it is possible that it will enable TSMC to directly produce 3D ICs for their customers. This technology would not only maintain Moore’s Law, but also could be expected to bring about further breakthroughs in the performance of single chips.

The key to developing this technology is achieving a joint structure without protrusions; therefore, it is very likely that through-silican vias (TSV) technology is being utilized to directly communicate to multiple chip layers through very small pores.

However, even more amazingly, TSMC’s SoIC technology can be used in 10nm and below manufacturing, meaning that chips in the future can be close to the same volume, which will more than double their performance capabilities. As a result, even TSMC themselves are very optimistic about this manufacturing technology.
 

vpance

Member
^ 2021 but I imagine it'll be limited to mobile tier chips to start. PS5 is gonna be another boring APU.
 
Last edited:
DeepEnigma DeepEnigma I swear not to derail this thread furher after this - but playing pgr3 I didn't think it was any crisper than 4, immediately could tell it's less detailed and more jagged BUT it definitely has a somewhat more colorful aesthetic and glossier cars. You could say pgr4 is a bit more muted. They both have the same motion blur.

Pgr3 looks a bit more like an arcade title i guess, nice looking game overall still!
 

DeepEnigma

Gold Member
DeepEnigma DeepEnigma I swear not to derail this thread furher after this - but playing pgr3 I didn't think it was any crisper than 4, immediately could tell it's less detailed and more jagged BUT it definitely has a somewhat more colorful aesthetic and glossier cars. You could say pgr4 is a bit more muted. They both have the same motion blur.

Pgr3 looks a bit more like an arcade title i guess, nice looking game overall still!

I don't know what it was then. I was playing on a Dell 1920 x 1200 monitor in 1080p mode at the time, so maybe 3 scaled better on it?

I felt it looked softer, but it might be what you said with the color muting and possibly the sharpness due to more jaggies.

I miss those games. Closest now to them is Driveclub for me now.
 
I don't know what it was then. I was playing on a Dell 1920 x 1200 monitor in 1080p mode at the time, so maybe 3 scaled better on it?

I felt it looked softer, but it might be what you said with the color muting and possibly the sharpness due to more jaggies.

I miss those games. Closest now to them is Driveclub for me now.

Ah... Thats a real possibility monitors never did scale well and pgr3's vertical resolution is exactly half of 1200p.

Maybe I should pick up driveclub, I have 0 racers for ps4 lol Gt sport looks super nice graphics wise but i'm waiting for gt7 with a proper sp mode. Did they ever make a driveclub complete version?
 

DeepEnigma

Gold Member
Ah... Thats a real possibility monitors never did scale well and pgr3's vertical resolution is exactly half of 1200p.

Maybe I should pick up driveclub, I have 0 racers for ps4 lol Gt sport looks super nice graphics wise but i'm waiting for gt7 with a proper sp mode. Did they ever make a driveclub complete version?

I believe so. When you get Driveclub I think you can get all the content, they also have the bikes expansion too.

I have the physical version, but when they put Driveclub on sale one time for under $10 I snagged the digital up with the bikes.

A wealth of content, and still a gorgeous game. The weather effects are phenomenal.
 

SonGoku

Member
but their findings is that with the method they will be using they will be within 10% of this imaginary 256 SM monolithic GPU even with the limited bandwidth of the connectors (XBAR).
That's why im confused
How is a 45.5% speed up within 10% range of a 100% speedup?
Gonzalo? Will it make it pass the border?
Yes but it has to go back ;P
but i'm waiting for gt7 with a proper sp mode
lol true, hopefully its cross gen
^ 2021 but I imagine it'll be limited to mobile tier chips to start. PS5 is gonna be another boring APU.
If they can pull a PS4 and have 32GB of HBM at 1TB/s (pipe dream) it would be plenty exiting coupled with a 12TF+ GPU
 
Last edited:

onQ123

Member
That's why im confused
How is a 45.5% speed up within 10% range of a 100% speedup?

Yes but it has to go back ;P

lol true, hopefully its cross gen

If they can pull a PS4 and have 32GB of HBM at 1TB/s (pipe dream) it would be plenty exiting coupled with a 12TF+ GPU


Probably has something to do with the fact that you can feed 2X the memory bandwidth to a GPU made up of the 4 GPUM
 

SonGoku

Member
Probably has something to do with the fact that you can feed 2X the memory bandwidth to a GPU made up of the 4 GPUM
Im curious why 4? I noticed you mentioned it earlier, what makes 4 special? isn't that a bit on the weak side?
 

onQ123

Member
Im curious why 4? I noticed you mentioned it earlier, what makes 4 special? isn't that a bit on the weak side?


The GPUMs still need to fit on a package & it all still have to make sense .

the 1st one might be 4 x 64SM then the next might be 8 x 64SM
 

onQ123

Member
I meant your PS5 speculation of 4x18CU

It's just what make sense to me because it would scale from PS4 , PS4 Pro , PS5 , PS Now Servers that could stream PS4 & PS5 games & also make it's way into a wearable PS4 for PSVR down the road when the chip get even smaller.


Another benefit is that PS5 would have a wider bus to GDDR5/GDDR6 without having to use HBM. also if the rops are tied to the GPUMs PS5 would have 128 rops for 120 - 240fps VR & 2nd screen 4K .
 

onQ123

Member
Oh & I forgot about remote play ,you should be able to serve 4 PS4 games at a time & if PSVR become wireless 4 people could play PSVR games.
 

Pimpbaa

Member
This is what you are not getting: Streaming is done to overcome memory limitations, what you are saying would apply if you where comparing systems with the same amount of memory but you are not...
The 24GB system has a 8GB buffer advantage at all times, while the other system is busy streaming. The 24GB system would be two steps ahead at all times. 8GB won't be enough lol? thats more than what current consoles have to work with (5GB).

The problem with your example is that game is designed for low memory configurations and relies heavily on streaming because of it, of course more ram wont help if the game is not designed to use it. This would not be an issue on games designed to run with ample amounts of memory. A SSD just isn't worth the sacrifice in memory or gpu which is why it will never make it to consoles until it actually replaces HDDs. Best case scenario we get 100 to 250GB of high speed flash memory built in to the motherboard to act as a cache and even that would add cost im not sure console manufacturers we'll be willing to take.

Do you not know the parallelism that SSDs are capable of? While a hard drive is trying to load in 1 piece of data, the SSD would be loading many and at a much faster rate to boot. If you cannot see how that would be better than 8GBs of ram with dated technology (hard drives, which are already impacted current consoles), then I don't know what to say. Especially when there are examples of how hard drives impact games right now (like WoW) on PC. Hell I see it with my hybrid drive on my PS4 Pro, textures taking their sweet as time to load in in some games. With next gen games most likely ramping up the size of assets by a great deal, next gen consoles would truly be fucked with a regular hard drive regardless of how much ram it has.
 

SonGoku

Member
It's just what make sense to me because it would scale from PS4 , PS4 Pro , PS5 , PS Now Servers that could stream PS4 & PS5 games & also make it's way into a wearable PS4 for PSVR down the road when the chip get even smaller.


Another benefit is that PS5 would have a wider bus to GDDR5/GDDR6 without having to use HBM. also if the rops are tied to the GPUMs PS5 would have 128 rops for 120 - 240fps VR & 2nd screen 4K .
But 4x18CU is really on the weak size, you don't need mcm to reach <8tf.
You won't even get 60fps let alone 120fps with a <8tf system
Do you not know the parallelism that SSDs are capable of? While a hard drive is trying to load in 1 piece of data, the SSD would be loading many and at a much faster rate to boot. If you cannot see how that would be better than 8GBs of ram with dated technology (hard drives, which are already impacted current consoles),
Regular Ram runs circles around SSD let alone vram, 8GB buffer advantage is massive and wouldn't need to stream as much as the system with lower memory, that's what you are missing entirely
To give you some perspective: What you are suggesting is worse than saying a big pool of low speed ddr4 is better for GPU than a smaller but much faster pool of vram
Especially when there are examples of how hard drives impact games right now (like WoW) on PC.
Please provide said examples! Already covered WoW and why it doesn't apply. Games developed around scarce memory use won't be able to take advantage of extra memory so of course streaming has a bigger impact once you met the low memory requirements
Hell I see it with my hybrid drive on my PS4 Pro, textures taking their sweet as time to load in in some games.
PS4 Pro is memory starved, also good thing you have the option to switch drives.
With next gen games most likely ramping up the size of assets by a great deal, next gen consoles would truly be fucked with a regular hard drive regardless of how much ram it has.
Im sure developers and hardware designers know what's best for the cost to performance ratio. More ram will always win, they won't sacrifice ram or gpu for a ssd drive
Mark my words: SSDs won't make it to next gen consoles, best case scenario they ship with a 100-200 GB on board cache
 
Last edited:

Dontero

Banned
It's just what make sense to me because it would scale from PS4 , PS4 Pro , PS5 , PS Now Servers that could stream PS4 & PS5 games & also make it's way into a wearable PS4 for PSVR down the road when the chip get even smaller.

If it would have 4 gpu dies then you are looking at AT MINIMUM 6 dies on one chip.
That is more than their EPYC 64core servers.

No way that is going to happen.
At best you are looking at:

1 I/O die
1 CPU die
2 GPU dies.

1GPU die could be normal version 2 GPU die could be pro.
 

Three

Member
SonGoku SonGoku gt6 is still in the low 50s high 40s in 720p mode. Some of those "premium" cars didn't even have any detail in the cockpit dude. And, a race full of premium cars will run worse than a race full of ps2 leftovers.

PD became straight trash after gt4 on ps2. I'd actually argue they were only good on ps1.
No the premium cars had cockpit views except fantasy cars. What you are suggesting has nothing to do with the hardware power to run native res and is just about asset creation. GT Sport is all premium now.
 
Last edited:
No the premium cars had cockpit views except fantasy cars. What you are suggesting has nothing to do with the hardware power to run native res and is just about asset creation. GT Sport is all premium now.
No they didn't. I've seen ps3 modeled cars without the telltale ps2 textures and jaggies that had no cockpit. Who gives a shit whether it was the hardware or not? The point is PD make an inconsistent unfinished product.

And, stay with me on this one Kay - if you've got a bunch of ps2 quality cars on a track vs all premium guess which race you'll get better frame rates on? The one where much more geometry and high res textures are present perhaps?

Gt sport looks really nice but it is also an unfinished online game. I've no patience for PD apologists.
 
Last edited:

Three

Member
No they didn't. I've seen ps3 modeled cars without the telltale ps2 textures and jaggies that had no cockpit. Who gives a shit whether it was the hardware or not? The point is PD make an inconsistent unfinished product.

And, stay with me on this one Kay - if you've got a bunch of ps2 quality cars on a track vs all premium guess which race you'll get better frame rates on?

Gt sport looks really nice but it is also an unfinished online game. I've no patience for PD apologists.

They upgraded some of the standard model textures on PS3 GT6 but all premium models that weren't fantasy had cockpits. GT Sport is also not an unfinished online game it has a single player with more events and it's very polished. Nothing unfinished about it. The consistency in GT Sport car models is actually higher than other racing games right now. All cars are premium cars.

Who gives a shit if it's relevant to the hardware or not? I don't know, the context of the discussion maybe?
Or maybe you thought this was just an opportunity to shit on GT in which case I have no time for GT haters either.
 
Last edited:
You can like GT and still be aware of its faults. 30+ patches later for gt sport is not a finished product.

Songoku was the one saying gt6's inconsistency wasn't about the hardware - I never said it was sunshine.

Only corporate slaves get this defensive - blocked.
 
Last edited:

ethomaz

Banned
You can like GT and still be aware of its faults. 30+ patches later for gt6 sport is not a finished product.

Only corporate slaves get this defensive - blocked.
GT Sport is a finished game since day one.

The patches only added new content and balance fixes... well there is the boring GT League mode added in December 2017 that add nothing to the game.
 
Last edited:
GT Sport is a finished game since day one.

The patches only added new content and balance fixes... well there is the boring GT League mode added in December 2017 that add nothing to the game.

Did gt4 need balance fixes and new cars and tracks? No because they finished games back then.

Kids it's not finished when you continue to work on it!
 
Last edited:

ethomaz

Banned
Did gt4 need balance fixes and new cars and tracks? No because they finished games back then.

Kids it's not finished when you continue to work on it!
No game have that before internet downloads become a thing lol

It is a finished product that receive new content every month... what age do you live?
 
Last edited:

ethomaz

Banned
A better age where I wasn't conditioned to accept incomplete games. I like my age more.
Thanks God we have few incomplete games this generations...

99% of the games can be played from beginning to end without lacking anything... rare exceptions happened like MGSV or FFXV.
 
Last edited:
Thanks God we have few incomplete games this generations...

99% of the games can be played from beginning to end without lacking anything... rare exceptions happened like MGSV or FFXV.
I bought titanfall 2 and it kept crashing on startup before I patched it, and Doom ran at 50fps with massive tearing pre patch. With the patch it ran like silk. I can go on.

Ain't shit complete this gen. Keep sucking corporate dick though if you want.
 
Last edited:

onQ123

Member
O onQ123 , any thoughts on the possible inclusion of nvme and how that would fit into a super efficient memory system?

My thoughts is that it will help with machine learning & game streaming.


For example with all the 3D scanning of real world objects there should be a pretty big data base of all these objects by now so if that data base of how to create all these objects was basically saved in your console's brain games could be a lot smaller which would make sense for game streaming.


That's just my forward thinking of what can be done but here is a blog post from Western Digital

https://blog.westerndigital.com/powering-gaming-experience-nvme/



The world of gaming is changing dramatically. New technologies are bringing ultra-high resolutions, higher frame rates, computed virtual field of view and ultimately, incredible immersive experiences! But fueling these enhancements requires strong hardware components that can support large graphics, 4K/8K video, and VR/AR content. This week at PAX East in Boston, MA, Western Digital launched its first 3D NAND NVMe SSD. Here’s how the Western Digital® WD Black™ NVMe™ SSD is ready to take on the highest-intensity gaming applications.
Over fifteen years after the first SATA drive was introduced, the world is ready for the next evolution in storage and computing.
As SATA makes the personal computing experience more productive and efficient, it was never designed for flash technology. With a maximum theoretical transfer speed of only 600MB/s, high-intensity software development is limited because of the SATA ceiling. The NVMe standard on the other hand was designed specifically to take advantage of the speed offered with SSDs. With a maximum transfer speed of 985MB/s per lane, the PCIe Gen3 interface can support up to 16 lanes, giving us a maximum transfer speed of up to 15.76GB/s! However, most systems today are using PCIe Gen3 with 4 lanes, giving you potential speeds of up to 3.94GB/s, which is still over 6X faster than what SATA III offers (potentially up to 600MB/s).
What does NVMe mean for me?
In a nutshell, it means that whatever your car looks like on the outside, you’re now driving around town with a race car engine. Practically speaking, would you need a race car engine if you’re stuck in traffic for 90 minutes a day on your way to work? Probably not. But if you’re on the Autobahn where certain sections have no speed restrictions, wouldn’t you want something that could push your car’s potential to its limits?
And that’s where NVMe comes into play. For everyday PC or corporate users that rely heavily on productivity tools, emailing, or light content creation, a SATA SSD system might just be what you’re looking for. NVMe on the other hand is optimized for high-intensive applications, such as gaming, video editing, designing, CAD modeling, or image rendering. It’s also enabling businesses and users in the area of machine learning, IoT edge workloads, and enterprise databases to deliver a richer, more responsive user experience.
Below are some tests we’ve performed to show how NVMe pushes the boundaries of performance and helps you get in the game faster:
Tested using 1TB Western Digital WD Black NVMe SSD vs. 1TB WD Blue SATA SSD on Intel iCORE 7, 8GB RAM, NVIDIA GeForce GTX 850M.
We also noticed consistent, sustained performance when copying large video files onto the WD Black NVMe SSD.
File copying using 1TB Western Digital WD Black NVMe SSD.
In comparison, when copying large files to the WD Blue SATA SSD, transfer speeds peak at the onset and then tapers off.
File copying using 1TB WD Blue SATA SSD.
This is ideal for video editing professionals that have large 4K video files or content creators that render massive files within their system.
The Western Digital WD Black™ NVMe™ SSD is offered in three different capacities: 250GB, 500GB, and 1TB, so whether you plan on using a lower capacity drive as a boot up device, or you just want that extra storage for long term use, this is the perfect solution for your gaming needs.
Here are some under the hood specs:
Beyond the ultra-fast responsiveness that you’ll be getting out of your system, the high endurance means you can confidently write large amounts of data to your drive.
If you’re in the Boston area this week, make sure you stop by PAX East booth #15001 at the Boston Convention and Exhibition Center to see our gaming system showcase featuring iBUYPOWER and our RAID system build featuring AMD Ryzen™ Threadripper™ with six WD Black NVMe SSDs in RAID0.
Western Digital’s breadth of expertise and level of vertical innovation give us an unmatched ability to deliver carefully calibrated NVMe SSDs, storage platforms and fully featured flash storage systems to unleash every type and use of data. The WD Black™ NVMe™ SSD is a great example of how we’re helping users get ahead of the game!

By Andrew Vo
 
Top Bottom