• Hey, guest user. Hope you're enjoying NeoGAF! Have you considered registering for an account? Come join us and add your take to the daily discourse.

Oxide: Nvidia GPU's do not support DX12 Asynchronous Compute/Shaders.

Locuza

Member
Are you sure it supports FP16?
Like icecold1983 said, yes, but without double throughput.
But at least FP16 saves registers, so it can still improve performance.

it does not have double rate fp16, i believe pascal will be the first desktop chip to support it
Broadwell and Skylake should both support double rate FP16 throughput.
If we are talking about discrete GPUs, then maybe Pascal, maybe GCN Gen 4.
Whoever comes first.
 

tuxfool

Banned
it does not have double rate fp16, i believe pascal will be the first desktop chip to support it

I was asking about GCN 1.2.

However, I was under the impression that Maxwell 2 already supported double rate fp16, but it turns out it is only the variant found in the Tegra X1.
 

Fractal

Banned
True or not, wouldn't worry about this at all... right now DX12 is pretty far from being fully adopted by the industry. By the time it becomes relevant, it's a sure thing Nvidia will fix any compatibility issues... or start losing money and market share to AMD, and I doubt that's what they're aiming for.

My 780 Ti is already starting to show its age on more demanding titles at 1440p, won't be keeping it around by the time DX12 is properly established.
 

dr_rus

Member
What we need here is some official information from NVIDIA on this.

I kinda don't believe these guys at all considering that they have an AMD logo on their website (that screenshot from PCARS which runs like shit on AMD's cards kinda proves the point here), the engine was created and showcased alongside Mantle (yeah, sure, NV may have spent more time trying to fix it in two months prior to launch but it clearly isn't fixed so either they denied whatever NV proposed or it simply wasn't possible to accomplish in such short timeframe) and there are several rather funny sentences in this blogpost - like calling AMD's h/w "Tier 3" and NV's "Tier 2", which means that they're using extended features of GCN above FL12_0 and don't use anything additional from Mw2's FL12_1 or saying that we should be thankful to AMD for not blocking them from working with NV (what? no marketing deal can block someone from testing their code on any hardware with or without IHV's support).

I can believe that async compute will lead to performance loss on Kepler and Fermi but from what we know of it right now Maxwell 2 should be ok with it.

Also a lot of people here don't really understand what async compute even is. There is nothing stopping NV from supporting it and in fact they have supported it somewhat earlier than AMD has - but for compute tasks only, not for graphics+compute.

As for features being exposed in the drivers but not really supported by the h/w - AMD has been doing this since DX9SM3 (no texture fetch in vertex shaders which basically made their R5x0 line non SM3-compatible but they've exposed this absent feature in their drivers anyway) and during the whole DX11 era (do their drivers even support DX11 deferred contexts now? aka "DX11 multithreading"? this is a required feature of DX11 which they didn't support last time I checked). So lol at anyone trying to paint AMD as some noble forward looking company fighting with evil NV and Intel overlords. They all lie.
 

KingBroly

Banned
True or not, wouldn't worry about this at all... right now DX12 is pretty far from being fully adopted by the industry. By the time it becomes relevant, it's a sure thing Nvidia will fix any compatibility issues... or start losing money and market share to AMD, and I doubt that's what they're aiming for.

My 780 Ti is already starting to show its age on more demanding titles at 1440p, won't be keeping it around by the time DX12 is properly established.

Honestly? By the time Nvidia starts making strides towards DX12, it will be at the point where their marketshare really can't get any higher without AMD dropping out.
 

Fractal

Banned
Honestly? By the time Nvidia starts making strides towards DX12, it will be at the point where their marketshare really can't get any higher without AMD dropping out.
That's possible as well... but what I'm trying to say, by the time full DX12 compliance is expected, Nvidia will have it... even if AMD goes down, the market itself will demand it, so they'll be pushed into it either way.

Honestly, I think this entire debate is pointless. By the time DX12 is a relevant factor, I'm pretty sure our current cards will be notably outdated and most of us will replace them. As soon as Pascal shows up, my 780 Ti will face a quick retirement.
 

tuxfool

Banned
(do their drivers even support DX11 deferred contexts now? aka "DX11 multithreading"? this is a required feature of DX11 which they didn't support last time I checked).

No, they don't. I don't think they ever will. Nvidia is the only IHV to support it.
 
Like icecold1983 said, yes, but without double throughput.
But at least FP16 saves registers, so it can still improve performance.


Broadwell and Skylake should both support double rate FP16 throughput.
If we are talking about discrete GPUs, then maybe Pascal, maybe GCN Gen 4.
Whoever comes first.

oh, intel gpus dont even factor into my thoughts tbh lol. i dont have a clue on their perf/features currently.

I was asking about GCN 1.2.

However, I was under the impression that Maxwell 2 already supported double rate fp16, but it turns out it is only the variant found in the Tegra X1.

yeah, no currently available gcn implementation has it
 

Locuza

Member
I can believe that async compute will lead to performance loss on Kepler and Fermi but from what we know of it right now Maxwell 2 should be ok with it.
As far as I see thinks, Maxwell 2 needs to preempt workloads, if he wants to execute something asynchronously.
He now has the ability to schedule up to 30+ compute-queues, but not asynchronous and only in a big package.

(do their drivers even support DX11 deferred contexts now? aka "DX11 multithreading"? this is a required feature of DX11 which they didn't support last time I checked). So lol at anyone trying to paint AMD as some noble forward looking company fighting with evil NV and Intel overlords. They all lie.
No, Intel also doesn't support deferred context, because they said it was not worth it.
 

tuxfool

Banned
As far as I see thinks, Maxwell 2 needs to preempt workloads, if he wants to execute something asynchronously.
He now has the ability to schedule up to 30+ compute-queues, but not asynchronous and only in a big package.

According to anandtech, Maxwell2 should have the ability to execute on the graphics command queue and a compute queue simultaneously. It shouldn't need to preempt like the Kepler and Fermi.
 

frontieruk

Member
ZbAIeHp.png


From Anandtech, looks like Nvidia should have big advantage under Maxwell 2
 

tuxfool

Banned
ZbAIeHp.png


From Anandtech, looks like Nvidia should have big advantage under Maxwell 2

That chart is a bit misleading.

Maxwell2 has 32 Queues (31+1). However what it lists there on the GCN architectures are ACEs, where each has 8 queues, so for ex. GCN1.1 has 8x8=64 queues. They would only have equivalence if all 32 Maxwell Queues could be scheduled independently (is this the case?), which is what each ACE does.
 

Arkanius

Member
Where did you read that / have that information?

It was in another forum apparently

A GTX 980 Ti can handle both compute and graphic commands in parallel. What they cannot handle is Asynchronous compute. That's to say the ability for independent units (ACEs in GCN and AWSs in Maxwell/2) to function out of order while handling error correction.

It's quite simple if you look at the block diagrams between both architectures. The ACEs reside outside of the Shader Engines. They have access to the Global data share cache, L2 R/W cache pools on front of each quad CUs as well as the HBM/GDDR5 memory un order to fetch commands, send commands, perform error checking or synchronize for dependencies.

The AWSs, in Maxwell/2, reside within their respective SMMs. They may have the ability to issue commands to the CUDA cores residing within their respective SMMs but communicating or issueing commands outside of their respective SMMs would demand sharing a single L2 cache pool. This caching pool neither has the space (sizing) nor the bandwidth to function in this manner.

Therefore enabling Async Shading results in a noticeable drop in performance, so noticeable that Oxide disabled the feature and worked with NVIDIA to get the most out of Maxwell/2 through shader optimizations.

Its architectural. Maxwell/2 will NEVER have this capability.

It's also in their documentation, trying to find it right now
 

backstep

Neo Member
...there are several rather funny sentences in this blogpost - like calling AMD's h/w "Tier 3" and NV's "Tier 2", which means that they're using extended features of GCN above FL12_0 and don't use anything additional from Mw2's FL12_1 or saying that we should be thankful to AMD for not blocking them from working with NV (what? no marketing deal can block someone from testing their code on any hardware with or without IHV's support).

The tier 3 on GCN vs tier 2 on maxwell thing is probably referring to D3D12's resource binding tiers. The most impactful difference is with the max bound CBVs per stage, but even then there's very little overhead in setting a new root CBV or a descriptor table.

On the wider discussion about what the lack of async compute means for DX12 performance, it'll make a difference, but it's hard to say how much because it depends on how a game is using compute (i.e. you'd get a decent gain if substantial compute work can be done early in the frame during low shader utilisation, such as async with shadow buffer rendering).

I wouldn't judge future performance from this single benchmark - you'd hardly choose a graphics card based on a review that only benchmarked a single DX11 game. Let alone a game that was originally built to run on a single vendor's hardware to demonstrate their own API (oxide's engine began with GCN using Mantle). Compute aside there are plenty of other decisions that can benefit one vendor's architecture but can ruin performance on another vendors. This is a presentation for DX11 but the issues for each vendor are highlighted in red and green respectively, and they are numerous, and are linked to their underlying hardware architectures.
 
Maybe i should sell my 980 ti and get a 390x

It seems the Amd gpu were built with a bit more "future proofing" than Nvidia, but from today, to the day most games are released in DX12/Vulkan... Nvidia will have launched a new gpu series with this hardware feature and that will be even better than the 390x. And I think a 980ti is powerful enough to hold you meanwhile.
 

Locuza

Member
oh, intel gpus dont even factor into my thoughts tbh lol. i dont have a clue on their perf/features currently.
Then look at this :)

ahfpkubn.png


The driver is not 100% finished (also not on AMD and maybe even Nvidia), FP16 and standard swizzle are supported.
With that, the Skylake Gen 9 iGPU is nearly 100% DX12 feature complete.

For GT4e you can expect over one TeraFLOP.

According to anandtech, Maxwell2 should have the ability to execute on the graphics command queue and a compute queue simultaneously. It shouldn't need to preempt like the Kepler and Fermi.
I think this is the complicated part, where it's not fully clear how Maxwell 2 executes.
It may dispatch queues simultaneously, but not stateless and without order.

Looking at Nvidias VR presentation, they use GPU preemption for the asynchronous time-warps, whereas AMD uses their ACEs without preemption.

Page 23:
https://developer.nvidia.com/sites/default/files/akamai/gameworks/vr/GameWorks_VR_2015_Final_handouts.pdf

ZbAIeHp.png


From Anandtech, looks like Nvidia should have big advantage under Maxwell 2
The table is misleading.

GCN 1.0 can only dispatch one compute-queue per ACE, whereas GCN 1.1 can up to 8.
Looking at the table you would think there is no difference between GCN 1.0 (7000/200 Series) and GCN 1.1 (260), since both times you are looking at "2 Compute".
In reality there are 2 Compute-Queues vs. 16.

The same goes for the other GCN 1.1 and 1.2 parts, which can schedule up to 64-compute-queues with 8 ACEs.
 

frontieruk

Member
It was in another forum apparently
A GTX 980 Ti can handle both compute and graphic commands in parallel. What they cannot handle is Asynchronous compute. That's to say the ability for independent units (ACEs in GCN and AWSs in Maxwell/2) to function out of order while handling error correction.

It's quite simple if you look at the block diagrams between both architectures. The ACEs reside outside of the Shader Engines. They have access to the Global data share cache, L2 R/W cache pools on front of each quad CUs as well as the HBM/GDDR5 memory un order to fetch commands, send commands, perform error checking or synchronize for dependencies.

The AWSs, in Maxwell/2, reside within their respective SMMs. They may have the ability to issue commands to the CUDA cores residing within their respective SMMs but communicating or issueing commands outside of their respective SMMs would demand sharing a single L2 cache pool. This caching pool neither has the space (sizing) nor the bandwidth to function in this manner.

Therefore enabling Async Shading results in a noticeable drop in performance, so noticeable that Oxide disabled the feature and worked with NVIDIA to get the most out of Maxwell/2 through shader optimizations.

Its architectural. Maxwell/2 will NEVER have this capability.


It's also in their documentation, trying to find it right now

Interesting...

I came across this article

First off nVidia is posting their true DirectX12 performance figures in these tests. Ashes of the Singularity is all about Parallelism and that's an area, that although Maxwell 2 does better than previous nVIDIA architectures, it is still inferior in this department when compared to the likes of AMDs GCN 1.1/1.2 architectures. Here's why...


Maxwell's Asychronous Thread Warp can queue up 31 Compute tasks and 1 Graphic task. Now compare this with AMD GCN 1.1/1.2 which is composed of 8 Asynchronous Compute Engines each able to queue 8 Compute tasks for a total of 64 coupled with 1 Graphic task by the Graphic Command Processor.

This means that AMDs GCN 1.1/1.2 is best adapted at handling the increase in Draw Calls now being made by the Multi-Core CPU under Direct X 12.

Therefore in game titles which rely heavily on Parallelism, likely most DirectX 12 titles, AMD GCN 1.1/1.2 should do very well provided they do not hit a Geometry or Rasterizer Operator bottleneck before nVIDIA hits their Draw Call/Parallelism bottleneck.

As for the folks claiming a conspiracy theory, not in the least. The reason AMDs DX11 performance is so poor under Ashes of the Singularity is because AMD literally did zero optimizations for the path. AMD is
clearly looking on selling Asynchronous Shading as a feature to developers because their architecture is well suited for the task. It doesn't hurt that it also costs less in terms of Research and Development of drivers. Asynchronous Shading allows GCN to hit near full efficiency without even requiring any driver work whatsoever.


nVIDIA, on the other hand, does much better at Serial scheduling of work loads (when you consider that anything prior to Maxwell 2 is limited to Serial Scheduling rather than Parallel Scheduling). DirectX 11 is suited for Serial Scheduling therefore naturally nVIDIA has an advantage under DirectX 11
The Developers programmed for thread parallelism in Ashes of the Singularity in order to be able to better draw all those objects on the screen. Therefore what we’re seeing with the Nvidia numbers is the Nvidia draw call bottleneck showing up under DX12. Nvidia works around this with its own optimizations in DX11 by prioritizing workloads and replacing shaders. Yes, the nVIDIA driver contains a compiler which re-compiles and replaces shaders which are not fine tuned to their architecture on a per game basis. NVidia’s driver is also Multi-Threaded, making use of the idling CPU cores in order to recompile/replace shaders. The work nVIDIA does in software, under DX11, is the work AMD do in Hardware, under DX12, with their Asynchronous Compute Engines

More at Source
 

ZOONAMI

Junior Member
It seems the Amd gpu were built with a bit more "future proofing" than Nvidia, but from today, to the day most games are released in DX12/Vulkan... Nvidia will have launched a new gpu series with this hardware feature and that will be even better than the 390x. And I think a 980ti is powerful enough to hold you meanwhile.

Yeah but, I'm always happy to pocket $300.
 

tuxfool

Banned
I think this is the complicated part, where it's not fully clear how Maxwell 2 executes.
It may dispatch queues simultaneously, but not stateless and without order.

Looking at Nvidias VR presentation, they use GPU preemption for the asynchronous time-warps, whereas AMD uses their ACEs without preemption.

Page 23:
https://developer.nvidia.com/sites/default/files/akamai/gameworks/vr/GameWorks_VR_2015_Final_handouts.pdf.

Wow. That is really interesting. They also mention that they're pre-empting with a graphics context. If I'm reading it correctly it suggests that they're doing the operation in the graphics pipe, instead of the compute pipe(s). Seeing as there is only one graphics command processor, they have no choice but to pre-empt other operations.
 
Wait, so Maxwell is fully DX12 compliant but does not have async compute like AMD cards have? Does this mean that PS4 is almost DX13 levels then due to having this feature as well as hUMA and a supercharged PC architecture which DX12 does not have? If so I can easily see PS4 competing with the next gen Xbox which will assumedly be based on DX13 further delaying the need for Sony to launch a successor. Woah. If this is true I can easily see PS4 lasting a full ten years. Highly interesting development, I can't wait to see what Naughty Dog and co do with this new found power.

Amazing
 

Diablos

Member
Whatever, I'm just glad my 6300 is apparently on par with a higher end dual core Haswell for single-threaded IPC in DX12 games.

By the time this parallel noise really matters my 660 will be so obsolete I'll need a new GPU (and likely CPU) anyway.
 

wachie

Member
Wait, so Maxwell is fully DX12 compliant but does not have async compute like AMD cards have? Does this mean that PS4 is almost DX13 levels then due to having this feature as well as hUMA and a supercharged PC architecture which DX12 does not have? If so I can easily see PS4 competing with the next gen Xbox which will assumedly be based on DX13 further delaying the need for Sony to launch a successor. Woah. If this is true I can easily see PS4 lasting a full ten years. Highly interesting development, I can't wait to see what Naughty Dog and co do with this new found power.
There was no need of shit posting. Thread was going fine and the discussion was actually quite good until your post.
 
Man I want a new GPU so bad, but there's no way I'm jumping in now. The only card that looks appetizing right now is a sub $300 390.
 

Locuza

Member
Wow. That is really interesting. They also mention that they're pre-empting with a graphics context. If I'm reading it correctly it suggests that they're doing the operation in the graphics pipe, instead of the compute pipe(s). Seeing as there is only one graphics command processor, they have no choice but to pre-empt other operations.
I guess that's the thing.
If Maxwell v2 would be compute-stateless and could handle the workload asynchronously, Nvidia would never implement async timewarp through a graphics context switch.
 

virtualS

Member
Well there you go. AMD have been innovating on the hardware side for years now but have been held back by Microsoft's API. Nvidia have been throwing more of the same old at consumers and tailoring things to suit Microsoft's old API. This has proved a winning strategy, I mean why would people think differently? They see the green team win on benchmarks and conclude what they conclude. Think deeply though and you begin to realise that the innovation AMD have implemented on the console side when finally unlocked through software on the PC side will show huge performance gains for AMD. My dual r9 290s should hold me over quite well for the remainder of this console generation.

Watch Nvidia aggressively demand all DX12 AMD innovations be swiched off on Nvidia sponsored games though. This is bound to happen.
 
Well there you go. AMD have been innovating on the hardware side for years now but have been held back by Microsoft's API. Nvidia have been throwing more of the same old at consumers and tailoring things to suit Microsoft's old API. This has proved a winning strategy, I mean why would people think differently? They see the green team win on benchmarks and conclude what they conclude. Think deeply though and you begin to realise that the innovation AMD have implemented on the console side when finally unlocked through software on the PC side will show huge performance gains for AMD. My dual r9 290s should hold me over quite well for the remainder of this console generation.

Watch Nvidia aggressively demand all DX12 AMD innovations be swiched off on Nvidia sponsored games though. This is bound to happen.

Surely that is why NV GPUs have future hardware features that AMD cards don't even have? Because they were building cards that only work well under the DX11 API, under whose tyrancy AMD cards suffered?

I think if you take a second to not read too much into things, you realize you are overamphasizing one data point and extrapolating LOTs out of it.
 
Surely that is why NV GPUs have future hardware features that AMD cards don't even have? Because they were building cards that only work well under the DX11 API, under whose tyrancy AMD cards suffered?

I think if you take a second to not read too much into things, you realize you are overamphasizing one data point and extrapolating LOTs out of it.

we dont know if those 12_1 features are even fast enough to be usable tho.
 

ZOONAMI

Junior Member
Man I want a new GPU so bad, but there's no way I'm jumping in now. The only card that looks appetizing right now is a sub $300 390.

I'm kind of thinking about selling both my 980 ti and 970 and popping in a couple gigabyte 390s on new egg for $299 each.

Seriously nvidia dx12 games are starting to come out and the just released 980 ti is looking like a 2 year old 290x.
 

KKRT00

Member
Seems like non issue for now and in future can give slight boost to AMD cards in relation to Nvidia, but like 10% at most.

I'm more bitter that we wont have full OIT till next gen in multiplatforms thanks to AMD's GCN in current gen consoles :(
 

Locuza

Member
we dont know if those 12_1 features are even fast enough to be usable tho.
All of these features should speed up things, the question is simply how complex something can be, before the hardware can't keep up.

ROVs on Intel are fast.
I'm curious how fast they are on Nvidia, but the chance is high, that even with a relative bad hardware implementation ROVs are faster than the alternatives we use today.
Conservative Rasterization is also useable on Maxwell and Intels Skylake.

I'm more bitter that we wont have full OIT till next gen in multiplatforms thanks to AMD's GCN in current gen consoles :(
If you liked ROVs in Consoles, the consoles had launched 2 years later.
Actually we can be quite happy to have GCN Gen 2 in there, instead of Kepler or any other architecture to the time.
 
All of these features should speed up things, the question is simply how complex something can be, before the hardware can't keep up.

ROVs on Intel are fast.
I'm curious how fast they are on Nvidia, but the chance is high, that even with a relative bad hardware implementation ROVs are faster than the alternatives we use today.
Conservative Rasterization is also useable on Maxwell and Intels Skylake.


If you liked ROVs in Consoles, the consoles had launched 2 years later.
Actually we can be quite happy to have GCN Gen 2 in there, instead of Kepler or any other architecture to the time.

there have been lots of hardware features that were suppose to speed things up, but due to poor implementations were graphics decelerators.

dynamic branching
geometry shaders
deferred contexts
plenty more

the twitter post i linked earlier is by a reputable guy according to dictator. pretty sure it was ROV he was saying was slow on nvidia
 
Reaching here but....AMD has apparently got priority on HBM2 manufacturing next year. We know that they are also more experienced with the new memory type as they had a year head-start on Nvidia. Add to that, they've apparently got some kind of architectural advantage with DX12 games.

What's that I see in the distance, AMD rising from the ashes next-gen? :p
 
Top Bottom