Support NeoGAF

Zathalus · Apr 26, 2023

Loxus said:
It was never about performance.
As for as I know, the AI Matrix Accelerator is not utilized as yet.

It was about you claiming Nvidia has dedicated block, while AMD does not.
I told you AMD ML and RT is just as dedicated as Nvidia's.

I have proven you wrong.
You can literally see labeled below, the processing block contains the FP/INT/Tensor that shared the same things.

You are a Nvidia fan, of course you are set in your ways.

Of course it's about performance. You can't seriously claim AMD and Nvidia are doing it the same way or that each is as dedicated as the other when AMD gets absolutely hammered by Nvidia on both ML and Path Tracing, with AMD literally being several generations behind.

winjer · Apr 26, 2023

Loxus said:
I think you are mixing up some terms.

In this document a SIMD refers to the Vector ALU unit that processes instructions for a single wave.

No confusion here. A SIMD just means Single Instruction Multiple Data. It refers to how the operands and operators are related.
And a Vector is a one-dimensional array of numbers. And of course, a vector processor is a unit that processes vectors.
And if you understand what a matrices is, you can understand it's relation to vectors.

K2D · Apr 26, 2023

Cost effectiveness and SoC advantages are more important for console manufacturers than machine learning.

My guess is that they're willing to adopt an amd equivalent rather than start from scratch with Nvidia.

supernova8 · Apr 26, 2023

I just don't see any reason for Sony and Microsoft to ditch AMD. They've proven to be willing to work with both parties to design fairly customized chips and have been pretty reliable in delivering said chips to go into consoles. The only issue I recall with the PS4 was the dodgy (insufficient and thus loud) coolers but that's more of a Sony issue. Besides, if the rumors are true, AMD has some unbelievably powerful APUs coming in late 2024 (capable of competing with RTX 4060/70 laptop in raster).

AMD already has FSR and even then Sony came up with its own checkerboard rendering thing so they would make do either way. As for ray tracing/path tracing, I'd imagine AMD could build custom chips for Sony/MS that allocate more resources to that side of things if they really want it.

In comparison, Nvidia has no x86 APU offering (unless they somehow team up with Intel, which would be a pain compared to just working with AMD and its APUs), and I don't see Microsoft and Sony suddenly jumping over to an ARM-based solution. Even if they did want a customized chip, there's no guarantee that Nvidia would even agree to doing that at a price competitive with what AMD is offering.

TL;DR: highly unlikely even if Nvidia is better in RT/PT.

Loxus · Apr 26, 2023

Zathalus said:
Of course it's about performance. You can't seriously claim AMD and Nvidia are doing it the same way or that each is as dedicated as the other when AMD gets absolutely hammered by Nvidia on both ML and Path Tracing, with AMD literally being several generations behind.

Well performance is solid in my opinion.
Remember the 7900 XTX is the 4080 competitor.

4080 - 23.504
7900 XTX - 19.297
6950 XT - 4.382
Stable Diffusion Benchmarked: Which GPU Runs AI Fastest

Anyone denying how good the 7900 XTX performance is here has to be tripping.

Zathalus · Apr 26, 2023

Loxus said:
Well performance is solid in my opinion.
Remember the 7900 XTX is the 4080 competitor.

4080 - 23.504
7900 XTX - 19.297
6950 XT - 4.382
Stable Diffusion Benchmarked: Which GPU Runs AI Fastest

Anyone denying how good the 7900 XTX performance is here has to be tripping.

It's matching the 3090ti, which is Nvidia's last flagship. A card that the 7900xtx is generally over 20% faster then for regular workloads. So yes, RDNA3 is matching Turing in this specific benchmark, as I said - several generations behind.

Roni · Apr 26, 2023

StereoVsn said:
This thread is crazy. Chance of MS and Sony of dropping a straightforward x86 SoC solution with great CPU performance and good enough GPU performance for ARM/Nvidia option is basically 0.

Between price, performance, backward compatibility options, and results being "good enough", they would be crazy to jump to Nvidia, especially after Nvidia's previous shenanigans.

Sometimes all you need is a cornered market, unless NVIDIA hits a brick wall or steps on the breaks now the difference will only increase. If they don't do it, I see PC's becoming what they were in the 90's and early 00's: a place to play with many bells and whistles. Wasn't like that during late 00's and the 10's.

Loxus · Apr 27, 2023

Zathalus said:
It's matching the 3090ti, which is Nvidia's last flagship. A card that the 7900xtx is generally over 20% faster then for regular workloads. So yes, RDNA3 is matching Turing in this specific benchmark, as I said - several generations behind.

What do you mean by several generations behind?
Are you saying the 3090ti is several generations old?

ChorizoPicozo · Apr 27, 2023

i dont think so....and i hope not.

Akuji · Apr 27, 2023

I can only imagine the very limited understanding of tech that is needed to come to such a conclusion.
With a fixed Hardware Set its way way easier to build these functions and make them run smoothly with ur own solutions then it is on a everchanging platform like a PC. Consoles show how powerfull they are for their price. Thats because of that very reason. Every Major player ( Sony,Microsoft and even Nintendo ) would have the funds and everything else needed to build software solutions for their hardware. No matter if it comes from Nvidia or AMD. Also Nvidia doesnt even have a single card at the pricepoint it would be needed for a console. So we dont even remotely know how yields/performance/power draw/etc would be for their solution. Its easy to say a 4090 is a beast, which yes ... it is. But that doesnt mean that if you give 100€/$ budget for a chip with marketing, opportunity cost ( you could sell the chip to an aib etc pp ) that the Nvidia option is the better option.

Also consoles are unlikely to be based on a current architecture. Likely the next with some featureset and advances from the one after that. And we dont have these cards yet so maybe AMD slashes Nvidia left and right by that time. Is that likely? no,duh. Is it impossible? also no.

Alexios · Apr 27, 2023

Loads of different companies, from Intel to Facebook, have their own solutions that don't require tensor cores (also other hardware has/is getting its own specialised bits for such AI and other processing loads, for all we know there will be non mobile focused snapdragons coming by the next gen). Whether you think it's not as good because some comparison showed more artifacts in some games when zoomed in 500% or not doesn't mean they'll go Nvidia over just that, they need to offer the better deal overall on top, in every field, not just what may be equaled within 3D engines/software.

lestar · Apr 27, 2023

Imagine a world where a switch 3 support path tracing and dlss4

Loxus · Apr 27, 2023

Akuji said:
I can only imagine the very limited understanding of tech that is needed to come to such a conclusion.
With a fixed Hardware Set its way way easier to build these functions and make them run smoothly with ur own solutions then it is on a everchanging platform like a PC. Consoles show how powerfull they are for their price. Thats because of that very reason. Every Major player ( Sony,Microsoft and even Nintendo ) would have the funds and everything else needed to build software solutions for their hardware. No matter if it comes from Nvidia or AMD. Also Nvidia doesnt even have a single card at the pricepoint it would be needed for a console. So we dont even remotely know how yields/performance/power draw/etc would be for their solution. Its easy to say a 4090 is a beast, which yes ... it is. But that doesnt mean that if you give 100€/$ budget for a chip with marketing, opportunity cost ( you could sell the chip to an aib etc pp ) that the Nvidia option is the better option.

Also consoles are unlikely to be based on a current architecture. Likely the next with some featureset and advances from the one after that. And we dont have these cards yet so maybe AMD slashes Nvidia left and right by that time. Is that likely? no,duh. Is it impossible? also no.

Nvidia will most likely be always one gen ahead of AMD. Even though AMD will be one gen behind, it's tech would always be good enough to be put in consoles which gives good results.

This article gives some insight on AMD future hardware and in turn, what's to expect with the next consoles.

AMD plans to harness the power of AI to transform gaming with its next-gen GPUs
AMD executives David Wang and Rick Bergman have confirmed that we’ll be seeing a lot more AI in the next generation of graphics cards from the tech giant, which will be built on AMD’s RDNA 4 architecture.

In a recent interview with the Japanese gaming website 4gamer, the AMD execs detailed some of what we can expect from RDNA 4. Naturally, front and center was confirmation that we’ll be seeing the second iteration of Team Red’s AI Accelerator cores (similar to Nvidia’s Tensor cores), which were first introduced in the current-gen RDNA 3 GPUs - such as the excellent Radeon RX 7900 XTX, currently the best AMD graphics card on the market.

Nvidia’s tech is still lightyears ahead of AMD when it comes to AI processes - just look at the RTX 4090 - but these second-gen AI cores should offer a serious step up. Beyond the Accelerator cores, the pair also discussed some other nifty new features, most importantly a new self-contained GPU pipeline that allows for rendering and texture processes to be generated exclusively on the GPU without needing to communicate with the CPU.

This has massive potential to boost the processing speed of RDNA 4 GPUs, since it won’t need to rely on the CPU and system RAM to carry out some of its workloads, effectively cutting out two potential system bottlenecks. According to Wang and Bergman, we can expect a massive 2.2x performance boost over the current RDNA 3 cards.

diunxx · Apr 27, 2023

FSR is good enough.

Buggy Loop · Apr 27, 2023

Loxus said:
It was never about performance.

Loxus said:
As for as I know, the AI Matrix Accelerator is not utilized as yet.

Right...

The AMD magic sauce coming soon^TM

Loxus said:
It was about you claiming Nvidia has dedicated block, while AMD does not.

Are you still like those peoples from RDNA 2 early rumours that saw an RT block and said "AMD has dedicated RT!" too? There's a world of difference between having silicon dedicated to a task and having them all work concurrently with their own pathways to not stall the pipeline.

Loxus said:
I told you AMD ML and RT is just as dedicated as Nvidia's.

You're looking at blocks with no pathway details. You sure told me!

http://[URL][URL][URL][URL]https://www.techspot.com/article/2570-gpu-architectures-nvidia-intel-amd/

[/URL][/URL][/URL][/URL]

"Another significant new feature is the appearance of what AMD calls AI Matrix Accelerators.

Unlike Intel's and Nvidia's architecture, which we'll see shortly, these don't act as separate units – all matrix operations utilize the SIMD units and any such calculations (called Wave Matrix Multiply Accumulate, WMMA) will use the full bank of 64 ALUs.

Intel also chose to provide the processor with dedicated units for matrix operations, one for each Vector Engine. Having this many units means a significant portion of the die is dedicated to handling matrix math.

Where AMD uses the DCU's SIMD units to do this and Nvidia has four relatively large tensor/matrix units per SM, Intel's approach seems a little excessive, given that they have a separate architecture, called Xe-HP, for compute applications."

"Dedicated"

The AI accelerators doubles BF16 execution by 2x only over RDNA 2. Nvidia tensor cores have been doing for a long time, back to Volta even,

So outstanding that AMD's CDNA matrix cores (which RDNA 3 copied the WMMA matrix multiplication) whitepaper made their business case of having outstanding performances by ...

Calculations conducted by AMD Performance Labs as of Sep 18, 2020 for the AMD Instinct™ MI100 (32GB HBM2 PCIe® card) accelerator at 1,502 MHz peak boost engine clock resulted in 11.54 TFLOPS peak double precision (FP64), 46.1 TFLOPS peak single precision matrix (FP32), 23.1 TFLOPS peak single precision (FP32), 184.6 TFLOPS peak half precision (FP16) peak theoretical, floating-point performance. Published results on the NVidia Ampere A100 (40GB) GPU accelerator resulted in 9.7 TFLOPS peak double precision (FP64). 19.5 TFLOPS peak single precision (FP32), 78 TFLOPS peak half precision (FP16) theoretical, floating-point performance. Server manufacturers may vary configuration offerings yielding different results. MI100-03

For FP16, they only reference the "78 TFLOPS peak half precision (FP16) theoretical" value from Nvidia's whitepaper - which is the no tensor-core value. With tensorcores its 312 TFLOPS (624 for sparse matrix ops).

Everyone is clearly jumping on AMD ML

"Dedicated" RT as in having the ray accelerator fighting for the compute unit, which is either picking the texture OR ray ops/clk and are not concurrent. Shader unit is idle when the pipeline is busy doing pure RT instructions.

The very definition of it in AMD's patent being that it's an hybrid setup. It is not concurrent.

And then you have peoples coming in saying "Look! NOT that far behind!" when the 7900XTX falls to 3090 level in RT while just sweeping under the rug that there's a monumental difference between them in raster performance baseline.

Indirect RT / Path tracing maps better to Nvidia and Intel when going to a dedicated RT core as they can group the same shaders into warps, etc. AMD has intersection instructions in the TMU, they aren't going to be regroup threads. So AMD compiler has been seen as compiling indirect into an ubershader... which brings us to the next point :

Portal RTX is tailored to Nvidia? Nvida gimping poor AMD?

They looked into the game and how AMD handled the AGNOSTIC API calls and it has nothing to do with RT. AMD's compiler thought it best to peg the shader at 256 VGPRs which is the maximum use per wavefront and is spilling to cache. AMD compiler basically makes one GIGANTIC Ubershader that takes 99ms of frametime by itself.

Loxus said:
I have proven you wrong.
You can literally see labeled below, the processing block contains the FP/INT/Tensor that shared the same things.

With the simplified architecture overview?

D E D I C A T E D
A N D
I S O L A T E D
P A T H W A Y S

Those are not in simplified block diagrams

Also tidbits we can't ignore

Benefits of shared memory usage
In section 7, we excluded the experiments for data movement from global memory to shared memory and assume the data are ready in shared memory for ldmatrix instructions which can only fetch data from shared memory to registers. We explained in section 2 with two reasons:
1 Using shared memory as the buffer to reduce global memory traffic and increase data reuse.
2 The novel asynchronous memory copy introduced in Ampere Architecture facilitates the software data pipeline by using shared memory as the staging storage to overlap the data movement with computation. The first point - using shared memory to increase data reuse has been widely used and been a fundamental optimization technique. A detailed study of this technique can be found in textbook [16] so we exclude further discussions here and emphasize the second point - new asynchronous memory copy.
Asynchronous memory copy acceleration was intro duced in Ampere Architecture [30]. It allows asynchronous data movement from off-chip global memory to on-chip shared memory. Compared to the old synchronous copy fashion, asynchronous copy can be leveraged to hide the data copy latency by overlapping it with computation.

Loxus said:
You are a Nvidia fan, of course you are set in your ways.

Edit:
Latency is not an issue for RDNA 3.
Microbenchmarking AMD’s RDNA 3 Graphics Architecture

LDS specifically? really?

The one that had 5.6% utilization for Cyberpunk inline RT micro-bench? You need low latency because the RT will bog it down soon as you go indirect RT or... heaven forbid, path tracing.

Having 4 level cache vs 2 doesn't add latency too? It's just magic? Seems like there's something more going on outside the LDS..

But yes, that AMD magic is coming to save this. Somehow nobody is leveraging AMD hardware features. Intel comes in all bruised in and late to the party on first iteration and with shit drivers but somehow they're pretty well placed.

After years of promises, AMD will one day be good at Blender.

Blender-Cycles-GPU-Rendering-Performance-Secret-Deer-1.jpg

RDNA 3 is so good that there's no feedback to give AMD at all from now on. They nailed it. AMD fans are certainly AMD's worse enemies. Who needs to worry about Nvidia when your fanbase is like that.

AMD is currently a good option for gaming. Games have not strayed that much into ML at all nor RT except for a few specific cases. That’s fine! But AMD for dedicated ML or RT workloads? They’re not there.

Zathalus · Apr 27, 2023

Loxus said:
What do you mean by several generations behind?
Are you saying the 3090ti is several generations old?

No, I'm saying Turing is. The 7900 XTX is only matching the 3090ti in this specific benchmark despite being the faster card, thus the ML capabilities do not match Ampere and are worse. So generations behind in this one benchmark. Same goes for RT really, in any demanding RT benchmark RDNA3 falls behind even Ampere.

rnlval · Apr 27, 2023

SABRE220 said:
Its not a realistic possibility, Nvidia has already showcased that they have little desire and ability to nurture and maintain sustainable longterm relationships with partners. They have time and time again screwed over their business partners and are cuthroat and oppurtunistic to the extreme, its sad how even evga got pushed to ending things, and microsoft and sony have already been burned in the past...that being said unless amd gets off their asses they might be forced to research better alternatives as it would become increasing difficult to justify a meaningful difference between ps5 and ps6 unless amd actually remembers their past heritage in innovation and pushing tech from before the ps4 days..but sadly i dont think there are any options but intel could be a darkhorse.

To their credit nvidias greedy and ruthless strategy has worked for them, they have virtually no competition in the gpu market....amd is almost 1.5 generations behind in tech and features and nvidia dosent need to dabble into the console market to maintain their financial goals. I hate nvidias greed and predatory tactics over the last decade....I remeber the 8800gt and now look at the 4070 and its not pretty. The sad thing is even outside of path tracing amds flagship offering is inferior to even a 3090 in rt workloads....its frs is even inferior to dlss2....and honestly when you look at the 4xxx series its atleast a generation difference.....I mean in path tracing scenarios its a complete massacre to the point its not even an option. Honestly as an amd fan i have stopped hoping after following the typical hype/bust cycle...I only see nvidia pulling further ahead unless intel manages to save us with a hailmary....which says a lot about amd when someone like intel manages to make a better rt rendering and (arguably) a more comprehensive image reconstruction tech in their first attempt.

RX 7900 XTX has about 33% fewer texture units when compared to RTX 4090, hence the reason why RX 7900 XTX lands about RTX 4080 level.

The main difference between RDNA 2's DCU (128 SP for wave32 and wave64) vs RDNA 3 DCU (128 SP for wave32 and wave64, new 128 SP for wave32) is AMD doubled stream processor units for wave32 instruction set without scaling the texture units. RDNA 3 DCU's extra SP units are geared towards geometry processing with RT transverse.

RX 7900 XTX's raster is similar to RTX 4090 level.

On RTX 4090, NVIDIA's path tracing for Cyberpunk 2077 needs DLSS's fake pixels since it's too bloated.

rnlval · Apr 27, 2023

Buggy Loop said:
Right...

The AMD magic sauce coming soon^TM

Are you still like those peoples from RDNA 2 early rumours that saw an RT block and said "AMD has dedicated RT!" too? There's a world of difference between having silicon dedicated to a task and having them all work concurrently with their own pathways to not stall the pipeline.

You're looking at blocks with no pathway details. You sure told me!

http://[URL][URL][URL][URL][URL][UR...icle/2570-gpu-architectures-nvidia-intel-amd/
[/URL][/URL][/URL][/URL][/URL][/URL]

"Another significant new feature is the appearance of what AMD calls AI Matrix Accelerators.

Unlike Intel's and Nvidia's architecture, which we'll see shortly, these don't act as separate units – all matrix operations utilize the SIMD units and any such calculations (called Wave Matrix Multiply Accumulate, WMMA) will use the full bank of 64 ALUs.

Intel also chose to provide the processor with dedicated units for matrix operations, one for each Vector Engine. Having this many units means a significant portion of the die is dedicated to handling matrix math.

Where AMD uses the DCU's SIMD units to do this and Nvidia has four relatively large tensor/matrix units per SM, Intel's approach seems a little excessive, given that they have a separate architecture, called Xe-HP, for compute applications."

"Dedicated"

The AI accelerators doubles BF16 execution by 2x only over RDNA 2. Nvidia tensor cores have been doing for a long time, back to Volta even,

So outstanding that AMD's CDNA matrix cores (which RDNA 3 copied the WMMA matrix multiplication) whitepaper made their business case of having outstanding performances by ...

For FP16, they only reference the "78 TFLOPS peak half precision (FP16) theoretical" value from Nvidia's whitepaper - which is the no tensor-core value. With tensorcores its 312 TFLOPS (624 for sparse matrix ops).

Everyone is clearly jumping on AMD ML

"Dedicated" RT as in having the ray accelerator fighting for the compute unit, which is either picking the texture OR ray ops/clk and are not concurrent. Shader unit is idle when the pipeline is busy doing pure RT instructions.

The very definition of it in AMD's patent being that it's an hybrid setup. It is not concurrent.

And then you have peoples coming in saying "Look! NOT that far behind!" when the 7900XTX falls to 3090 level in RT while just sweeping under the rug that there's a monumental difference between them in raster performance baseline.

Indirect RT / Path tracing maps better to Nvidia and Intel when going to a dedicated RT core as they can group the same shaders into warps, etc. AMD has intersection instructions in the TMU, they aren't going to be regroup threads. So AMD compiler has been seen as compiling indirect into an ubershader... which brings us to the next point :

Portal RTX is tailored to Nvidia? Nvida gimping poor AMD?

They looked into the game and how AMD handled the AGNOSTIC API calls and it has nothing to do with RT. AMD's compiler thought it best to peg the shader at 256 VGPRs which is the maximum use per wavefront and is spilling to cache. AMD compiler basically makes one GIGANTIC Ubershader that takes 99ms of frametime by itself.

With the simplified architecture overview?

D E D I C A T E D
A N D
I S O L A T E D
P A T H W A Y S

Those are not in simplified block diagrams

Also tidbits we can't ignore

Benefits of shared memory usage
In section 7, we excluded the experiments for data movement from global memory to shared memory and assume the data are ready in shared memory for ldmatrix instructions which can only fetch data from shared memory to registers. We explained in section 2 with two reasons:
1 Using shared memory as the buffer to reduce global memory traffic and increase data reuse.
2 The novel asynchronous memory copy introduced in Ampere Architecture facilitates the software data pipeline by using shared memory as the staging storage to overlap the data movement with computation. The first point - using shared memory to increase data reuse has been widely used and been a fundamental optimization technique. A detailed study of this technique can be found in textbook [16] so we exclude further discussions here and emphasize the second point - new asynchronous memory copy.
Asynchronous memory copy acceleration was intro duced in Ampere Architecture [30]. It allows asynchronous data movement from off-chip global memory to on-chip shared memory. Compared to the old synchronous copy fashion, asynchronous copy can be leveraged to hide the data copy latency by overlapping it with computation.

LDS specifically? really?

The one that had 5.6% utilization for Cyberpunk inline RT micro-bench? You need low latency because the RT will bog it down soon as you go indirect RT or... heaven forbid, path tracing.

Having 4 level cache vs 2 doesn't add latency too? It's just magic? Seems like there's something more going on outside the LDS..

But yes, that AMD magic is coming to save this. Somehow nobody is leveraging AMD hardware features. Intel comes in all bruised in and late to the party on first iteration and with shit drivers but somehow they're pretty well placed.

After years of promises, AMD will one day be good at Blender.

RDNA 3 is so good that there's no feedback to give AMD at all from now on. They nailed it. AMD fans are certainly AMD's worse enemies. Who needs to worry about Nvidia when your fanbase is like that.

AMD is currently a good option for gaming. Games have not strayed that much into ML at all nor RT except for a few specific cases. That’s fine! But AMD for dedicated ML or RT workloads? They’re not there.

FYI, RDNA 3 DCU's new 128 SPs do NOT execute the legacy wave64 instruction set, hence only half of RDNA 3's DCU stream processors can execute both legacy wave64 and wave32 instruction sets. HIP needs to be updated for wave32-only operations. HIP was designed to work with CDNA / CDNA 2 which is GCN Vega-based wave64-based instruction set.

SABRE220 · Apr 27, 2023

rnlval said:
RX 7900 XTX has about 33% fewer texture units when compared to RTX 4090, hence the reason why RX 7900 XTX lands about RTX 4080 level.

The main difference between RDNA 2's DCU (128 SP for wave32 and wave64) vs RDNA 3 DCU (128 SP for wave32 and wave64, new 128 SP for wave32) is AMD doubled stream processor units for wave32 instruction set without scaling the texture units. RDNA 3 DCU's extra SP units are geared towards geometry processing with RT transverse.

RX 7900 XTX's raster is similar to RTX 4090 level.

On RTX 4090, NVIDIA's path tracing for Cyberpunk 2077 needs DLSS's fake pixels since it's too bloated.

You can compare it to a 4080 and my argument still stands. In modern next-gen rendering pipelines which utilize a comprehensive ray tracing solution amds flagship falls behind even a 3090 at times. When path tracing becomes involved its a massacre and it falls behind even a damn 3080ti that is literally more than generations worth of difference.

People are not buying these enthusiast cards to play games with graphics settings turned off, please stop making the arguments for rasterization only as if its a solid argument. The reason PC gamers buy 1000+ dollar gpus is because they want to utilize the best graphics tech and want to see the tech pushed forward. Like it or not moving forward ML and RT workloads are going to become more and more prevalent and comprehensive. Amd and their stans can keep putting their head in the sand shouting raserization or they can actually try to compete and make their products competitive.

I was and have been a amd fan for years, I have seen them deliver amazing products that pushed tech and efficiency sometimes even beating nvidia. Even by the ps4 launch their gpus were amazing but their lax attitude and fall in standards have led them to basically being a distant second choice where they have abandoned any real effort in actually competing and have become content feeding off the scraps nvidia leaves in their product line. Their compute cores are not even at the level of the nvidia 3000 series and in machine learning workoads even intel who just launched their first dedicated gpu leapfrogged them in this aspect, the same is the case with their Rt tech where even a a770 line is more impressive in terms of rt tech. Their frs solution is still lacking any deep learning capabilities and they will inevitabely have to invest in it eventually to compete...their current offering is servicable but is agains inferior to the dlss that was available even on the 2080....

The truth is Amd has basically avoided the hardwork and risk associated with developing their tech with a long-term vision. They basically focused on only developing the tech their architecture was comfortable with while ignoring its shortcomings which has left them so far behind that nvidia has no fear basically. They are now forced to do what they should have done from the start and are investing in dedicated ML cores, neural learning, RT cores etc and are unfortunately are breaking into this tech when nvidia has advanced it for two gens.

Say what you want but when intel with their first gen product made a more ambitious architecure than amds recent offerings in terms of tech you know things are bad, the rdna 3 has honestly been the most unabmitious and dissapointing launch from amd in a while.

rnlval · Apr 27, 2023

Buggy Loop said:
After years of promises, AMD will one day be good at Blender.

RDNA 3 is so good that there's no feedback to give AMD at all from now on. They nailed it. AMD fans are certainly AMD's worse enemies. Who needs to worry about Nvidia when your fanbase is like that.

AMD is currently a good option for gaming. Games have not strayed that much into ML at all nor RT except for a few specific cases. That’s fine! But AMD for dedicated ML or RT workloads? They’re not there.

https://www.phoronix.com/review/rx7900-blender-opencl/2

rnlval · Apr 27, 2023

SABRE220 said:
You can compare it to a 4080 and my argument still stands. In modern next-gen rendering pipelines which utilize a comprehensive ray tracing solution amds flagship falls behind even a 3090 at times. When path tracing becomes involved its a massacre and it falls behind even a damn 3080ti that is literally more than generations worth of difference.

People are not buying these enthusiast cards to play games with graphics settings turned off, please stop making the arguments for rasterization only as if its a solid argument. The reason PC gamers buy 1000+ dollar gpus is because they want to utilize the best graphics tech and want to see the tech pushed forward. Like it or not moving forward ML and RT workloads are going to become more and more prevalent and comprehensive. Amd and their stans can keep putting their head in the sand shouting raserization or they can actually try to compete and make their products competitive.

I was and have been a amd fan for years, I have seen them deliver amazing products that pushed tech and efficiency sometimes even beating nvidia. Even by the ps4 launch their gpus were amazing but their lax attitude and fall in standards have led them to basically being a distant second choice where they have abandoned any real effort in actually competing and have become content feeding off the scraps nvidia leaves in their product line. Their compute cores are not even at the level of the nvidia 3000 series and in machine learning workoads even intel who just launched their first dedicated gpu leapfrogged them in this aspect, the same is the case with their Rt tech where even a a770 line is more impressive in terms of rt tech. Their frs solution is still lacking any deep learning capabilities and they will inevitabely have to invest in it eventually to compete...their current offering is servicable but is agains inferior to the dlss that was available even on the 2080....

The truth is Amd has basically avoided the hardwork and risk associated with developing their tech with a long-term vision. They basically focused on only developing the tech their architecture was comfortable with while ignoring its shortcomings which has left them so far behind that nvidia has no fear basically. They are now forced to do what they should have done from the start and are investing in dedicated ML cores, neural learning, RT cores etc and are unfortunately are breaking into this tech when nvidia has advanced it for two gens.

Say what you want but when intel with their first gen product made a more ambitious architecure than amds recent offerings in terms of tech you know things are bad, the rdna 3 has honestly been the most unabmitious and dissapointing launch from amd in a while.

The game's core workloads are textures and raster, these cover the game's primary artwork and NVIDIA failed with 8 GB VRAM RTX 3070 / 3070 Ti BS. LOL

What's the massive ML server use case for Sony's single-player story bias games?

For the record, I have ASUS TUF RTX 4090 24 GB.

and Gigabyte RTX 4080 16 GB Gaming OC.

Loxus · Apr 27, 2023

Buggy Loop said:
Right...

The AMD magic sauce coming soon^TM

Are you still like those peoples from RDNA 2 early rumours that saw an RT block and said "AMD has dedicated RT!" too? There's a world of difference between having silicon dedicated to a task and having them all work concurrently with their own pathways to not stall the pipeline.

You're looking at blocks with no pathway details. You sure told me!

http://[URL][URL][URL][URL][URL]htt...icle/2570-gpu-architectures-nvidia-intel-amd/
[/URL][/URL][/URL][/URL][/URL]

"Another significant new feature is the appearance of what AMD calls AI Matrix Accelerators.

Unlike Intel's and Nvidia's architecture, which we'll see shortly, these don't act as separate units – all matrix operations utilize the SIMD units and any such calculations (called Wave Matrix Multiply Accumulate, WMMA) will use the full bank of 64 ALUs.

Intel also chose to provide the processor with dedicated units for matrix operations, one for each Vector Engine. Having this many units means a significant portion of the die is dedicated to handling matrix math.

Where AMD uses the DCU's SIMD units to do this and Nvidia has four relatively large tensor/matrix units per SM, Intel's approach seems a little excessive, given that they have a separate architecture, called Xe-HP, for compute applications."

"Dedicated"

The AI accelerators doubles BF16 execution by 2x only over RDNA 2. Nvidia tensor cores have been doing for a long time, back to Volta even,

So outstanding that AMD's CDNA matrix cores (which RDNA 3 copied the WMMA matrix multiplication) whitepaper made their business case of having outstanding performances by ...

For FP16, they only reference the "78 TFLOPS peak half precision (FP16) theoretical" value from Nvidia's whitepaper - which is the no tensor-core value. With tensorcores its 312 TFLOPS (624 for sparse matrix ops).

Everyone is clearly jumping on AMD ML

"Dedicated" RT as in having the ray accelerator fighting for the compute unit, which is either picking the texture OR ray ops/clk and are not concurrent. Shader unit is idle when the pipeline is busy doing pure RT instructions.

The very definition of it in AMD's patent being that it's an hybrid setup. It is not concurrent.

And then you have peoples coming in saying "Look! NOT that far behind!" when the 7900XTX falls to 3090 level in RT while just sweeping under the rug that there's a monumental difference between them in raster performance baseline.

Indirect RT / Path tracing maps better to Nvidia and Intel when going to a dedicated RT core as they can group the same shaders into warps, etc. AMD has intersection instructions in the TMU, they aren't going to be regroup threads. So AMD compiler has been seen as compiling indirect into an ubershader... which brings us to the next point :

Portal RTX is tailored to Nvidia? Nvida gimping poor AMD?

They looked into the game and how AMD handled the AGNOSTIC API calls and it has nothing to do with RT. AMD's compiler thought it best to peg the shader at 256 VGPRs which is the maximum use per wavefront and is spilling to cache. AMD compiler basically makes one GIGANTIC Ubershader that takes 99ms of frametime by itself.

With the simplified architecture overview?

D E D I C A T E D
A N D
I S O L A T E D
P A T H W A Y S

Those are not in simplified block diagrams

Also tidbits we can't ignore

Benefits of shared memory usage
In section 7, we excluded the experiments for data movement from global memory to shared memory and assume the data are ready in shared memory for ldmatrix instructions which can only fetch data from shared memory to registers. We explained in section 2 with two reasons:
1 Using shared memory as the buffer to reduce global memory traffic and increase data reuse.
2 The novel asynchronous memory copy introduced in Ampere Architecture facilitates the software data pipeline by using shared memory as the staging storage to overlap the data movement with computation. The first point - using shared memory to increase data reuse has been widely used and been a fundamental optimization technique. A detailed study of this technique can be found in textbook [16] so we exclude further discussions here and emphasize the second point - new asynchronous memory copy.
Asynchronous memory copy acceleration was intro duced in Ampere Architecture [30]. It allows asynchronous data movement from off-chip global memory to on-chip shared memory. Compared to the old synchronous copy fashion, asynchronous copy can be leveraged to hide the data copy latency by overlapping it with computation.

LDS specifically? really?

The one that had 5.6% utilization for Cyberpunk inline RT micro-bench? You need low latency because the RT will bog it down soon as you go indirect RT or... heaven forbid, path tracing.

Having 4 level cache vs 2 doesn't add latency too? It's just magic? Seems like there's something more going on outside the LDS..

But yes, that AMD magic is coming to save this. Somehow nobody is leveraging AMD hardware features. Intel comes in all bruised in and late to the party on first iteration and with shit drivers but somehow they're pretty well placed.

After years of promises, AMD will one day be good at Blender.

RDNA 3 is so good that there's no feedback to give AMD at all from now on. They nailed it. AMD fans are certainly AMD's worse enemies. Who needs to worry about Nvidia when your fanbase is like that.

AMD is currently a good option for gaming. Games have not strayed that much into ML at all nor RT except for a few specific cases. That’s fine! But AMD for dedicated ML or RT workloads? They’re not there.

This should hopefully put an end to you not believe AMD ML is just as dedicated as Nvidia's.
I suggest you read both these articles in full.

Zenji Nishikawa's 3DGE: Primitive Shader vs. Mesh Shader Truth. Inside the Geometry Pipeline War and AMD's GPU Strategy for Gamers
AMD has incorporated an inference accelerator "AI Accelerator" into the GPU in the RDNA 3 architecture. This is equivalent to "Tensor Core" in NVIDIA's GeForce RTX series. Intel has already installed the same kind of "Xe Matrix Engine" (XMX) in the "Intel Arc" series of standalone GPUs, so AMD was the last to install an inference accelerator on the GPU.

　
Zenji Nishikawa's 3DGE: What has changed in the Radeon RX 7900 XTX/XT? Explore the secrets of the Navi 31 generation, which has achieved significant performance improvements
　Another hot topic at CU Pair is "AI Accelerator". This is equivalent to the inference accelerator "Tensor Core" installed in NVIDIA's GeForce RTX series. Intel also has the same kind of "Xe Matrix Engine" (XMX) in the "Intel Arc" series of standalone GPUs, so AMD was a little behind in installing an inference accelerator on the GPU.

　The AI Accelerator installed in Navi 31 has a configuration of 2 units per CU, and each AI Accelerator is equipped with 64 units of "Wave Matrix Multiply Accumulate" (WMMA), which is a 32-bit SIMD multiply-accumulator. WMMA, a 32-bit arithmetic unit, is actually a matrix arithmetic unit specialized for AI-related processing, and the numerical formats it can handle are limited to the following. FP32 is not supported.

16bit floating point (FP16)
BF16 (bfloat16: sign 1bit, exponent 8bit, mantissa 7bit)
8bit integer (INT8)
4-bit integer (INT4)

　Also, when handling INT8, the SIMD parallelism is the same as FP16 and BF16. INT4 finally increases parallelism.

　Now let's find the theoretical performance value of the AI Accelerator. Consider FP16, which is the most common example in the AI processing field.
　There are 64 AI Accelerator WMMAs, each of which supports 2-element product-sum calculation (2 FLOPS) in FP16, so the throughput per clock is as follows.

64 WMMA x 2 elements x 2 FLOPS = 256 FLOPS

Diagram of executing matrix product (vector inner product) with AI Accelerator

　
Two AI Accelerators are installed per CU, and Navi 31 has 96 CUs, so the FP16 theoretical performance values per clock are as follows.

256 FLOPS x 2 AI Accelerator x 96 CUs = 49152 FLOPS

　WMMA is driven by the GPU core clock, so if you apply 2.5 GHz of Radeon RX 7900 XTX, the FP16 theoretical performance value of AI Accelerator for the entire GPU will come out.

49152 FLOPS x 2.5GHz = 122.88 TFLOPS

As for this.

This is just calculating the max theoretical performance using the method above.
It doesn't take in account for the 2.7x throughput.

122.88 x 2.7 = 331.776 theoretical max
3090 Ti - 320.0 theoretical max

Which lines up with the actual performance.
7900 XTX- 19.296
3090 Ti - 19.238

This is the same thing with teraflops.
RDNA 3 10TF performs better than RDNA 1 10FT.
Calculation and numbers are the same but performance is different.

A good example is above, just look at the 3090 Ti vs 4070 Ti. Same max theoretical but different performance.

I would suggest you reading this article as well. AMD basically confirmed the AI Accelerators in RDNA 3 is equivalent the Nvidia Tensor cores and there will be AI Accelerator 2nd gen in RDNA4.

AMD plans to harness the power of AI to transform gaming with its next-gen GPUs
AMD executives David Wang and Rick Bergman have confirmed that we’ll be seeing a lot more AI in the next generation of graphics cards from the tech giant, which will be built on AMD’s RDNA 4 architecture.

In a recent interview with the Japanese gaming website 4gamer, the AMD execs detailed some of what we can expect from RDNA 4. Naturally, front and center was confirmation that we’ll be seeing the second iteration of Team Red’s AI Accelerator cores (similar to Nvidia’s Tensor cores), which were first introduced in the current-gen RDNA 3 GPUs - such as the excellent Radeon RX 7900 XTX, currently the best AMD graphics card on the market.

Nvidia’s tech is still lightyears ahead of AMD when it comes to AI processes - just look at the RTX 4090 - but these second-gen AI cores should offer a serious step up. Beyond the Accelerator cores, the pair also discussed some other nifty new features, most importantly a new self-contained GPU pipeline that allows for rendering and texture processes to be generated exclusively on the GPU without needing to communicate with the CPU.

This has massive potential to boost the processing speed of RDNA 4 GPUs, since it won’t need to rely on the CPU and system RAM to carry out some of its workloads, effectively cutting out two potential system bottlenecks. According to Wang and Bergman, we can expect a massive 2.2x performance boost over the current RDNA 3 cards.

Considering pass performance, AMD matching the 3090 Ti in both ML and RT shows just how much potential AMD has for the next consoles, no reason to switch to Nvidia.

Del_X · May 1, 2023

No - I don’t think so. Maybe after die shrinks or other cost cutting measures the next gen consoles can go from $599 or $699 to $499 - but not at launch.

I’d wager Microsoft, if they make another Xbox, launches at $699 a year ahead of PlayStation and then cuts price to parity later. That’s really the only strategy they have to be competitive like 360 generation in the console space.

Support NeoGAF

DLSS and Path Tracing could force console makers to go with NVIDIA in the future

Could sufficient advances in DLSS and Path Tracing support bring Sony to NVIDIA?

No, the competition eventually catches up and offers a similar current value.

No, the console market just doesn't care enough to afford the price.

Yes, they corner the market by subsidising the chip and outvalue the competition.

Yes, the difference will become even larger and consumers will pay for it in the end.

Member

Gold Member

Banned

Banned

Member

Member

Gold Member

Member

Banned

Member

Cores, shaders and BIOS oh my!

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Similar threads