• Hey, guest user. Hope you're enjoying NeoGAF! Have you considered registering for an account? Come join us and add your take to the daily discourse.

Oxide: Nvidia GPU's do not support DX12 Asynchronous Compute/Shaders.

KKRT00

Member
Again, this isn't a performance test. It merely test for the presence of fine-grained compute.
Sure compute performance is irrelevant, the behavior of GCN 1.1 vs 1.2 is irrelevant, drivers are irrelevant, or setting test up to use Maxwell properly is also irrelevant. Jesus.
Its really not great benchmark, its more like a test of standard DX12 drivers with common shader.

You still don't seem to get it. Yes, there are plenty of tricks you can employ to achieve similar performance gains, but you forget/ignore those same tricks are still available to the fine-grained systems.
Sure i dont. The other parts of my post of course dont matter, You had to only mention the one about GPU time in context of algorithm optimization ...

I've checked Your post history, it is all focused on Sony oriented threads or "console wars comparison' threads, so i'm out of discussion with You, its a waste of time when You look at tech related things only through one platform.
 
Nvm. Still trying to figure this out.
That's kinda what the thread is about. The new DX12 features intended to improve performance instead degraded instead degraded it. When asked about it, NV told the dev to test if it was an NV card and if so, treat it like DX11.

What the dev and I both find kinda odd is they're not checking for a specific NV family. In effect, they're saying if it's NV, just don't bother.


Sure compute performance is irrelevant, the behavior of GCN 1.1 vs 1.2 is irrelevant, drivers are irrelevant, or setting test up to use Maxwell properly is also irrelevant. Jesus.
I never said that.

Its really not great benchmark, its more like a test of standard DX12 drivers with common shader.
Again, it's testing for the presence of fine-grained compute.

Sure i dont. The other parts of my post of course dont matter, You had to only mention the one about GPU time in context of algorithm optimization ...
The other parts of your post were irrelevant or simply wrong.

I've checked Your post history, it is all focused on Sony oriented threads or "console wars comparison' threads, so i'm out of discussion with You, its a waste of time when You look at tech related things only through one platform.
Signing off with an ad hom? I accept your surrender.
 
I didn't think he was that funny. I thought the poor guy was as confused as you are. I was gonna type a long post to try to explain this stuff to him.

This, pretty much, summarizes all of your interventions and should disqualify you for the few people that still could take you seriously.

You are not worth the time.
 

DonasaurusRex

Online Ho Champ
hmm seems like amd chose wisely in its implementation of DX12 support, happens every gen hopefully the presence of ACE's help older gen GPU's squeeze some new perf in DX12. Either way moving forward 2016 is going to be exciting, wonder what Khronos will bring to the table.
 
This, pretty much, summarizes all of your interventions and should disqualify you for the few people that still could take you seriously.

You are not worth the time.
I should be ignored because I try to be helpful and try to give people the benefit of the doubt?

You make very strange arguments. =/
 

Alej

Banned
This, pretty much, summarizes all of your interventions and should disqualify you for the few people that still could take you seriously.

You are not worth the time.

Personal attacks then? Personal attacks.
*popcorn*

You weren't taken seriously for the get go BTW.
 

Locuza

Member
Sebastian Aaltonen from RedLynx about the little async compute test:
If the compiler is good, it could either skip the vector units completely by emitting pure scalar unit code (saving power) or emitting both scalar + vector in an interleaved "dual issue" way (the CU can issue both at the same cycle, doubling the throughput).

Benchmarking thread groups that are under 256 threads on GCN is not going to lead into any meaningful results, as you would (almost) never use smaller thread groups in real (optimized) applications. I would suspect a performance bug if a kernel thread count doesn't belong to {256, 384, 512}. Single lane thread groups result in less than 1% of meaningful work on GCN. Why would you run code like this on a GPU (instead of using the CPU)? Not a good test case at all. No GPU is optimized for this case.

Also I question the need to run test cases with tens of (or hundreds of) compute queues. Biggest gains can be had with one or two additional queues (running work that hits different bottlenecks each). More queues will just cause problems (cache trashing, etc).
https://forum.beyond3d.com/posts/1869700/
 

Belmire

Member
Excuse my ignorance, I am not as familiar with this subject material as many of the posters here.

One question; Lets assume that Maxwell 2 can't do Async the same way AMD can, or not at all, or even less of it. AMD currently does not support CR or ROVs. However, they were quoted saying that they can achieve the same results using other methods and that they have already demonstrated this in Dirt Rally.

Does this mean that Nvidia can take the same approach to solving it's assumed Async deficiency?
 
Excuse my ignorance, I am not as familiar with this subject material as many of the posters here.

One question; Lets assume that Maxwell 2 can't do Async the same way AMD can, or not at all, or even less of it. AMD currently does not support CR or ROVs. However, they were quoted saying that they can achieve the same results using other methods and that they have already demonstrated this in Dirt Rally.

Does this mean that Nvidia can take the same approach to solving it's assumed Async deficiency?

nope. asyncs purpose is to keep as much of the gpu saturated with work as possible.
 

Locuza

Member
One question; Lets assume that Maxwell 2 can't do Async the same way AMD can, or not at all, or even less of it. AMD currently does not support CR or ROVs. However, they were quoted saying that they can achieve the same results using other methods and that they have already demonstrated this in Dirt Rally.

Does this mean that Nvidia can take the same approach to solving it's assumed Async deficiency?
You can achieve similar things without the need of CR or ROVs, but CR & ROVs exists because they target directly common problems and inefficiency.
So alternatives will have draw-backs.
If Nvidia can't process Async Compute efficiently, then developers have to look how to handle this problem.

Either looking at a use-case which isn't hurting the performance or using different processing paths per vendor or not using it at all.
 

Renekton

Member
Given Nvidia's overwhelming marketshare, it's safe to say most devs won't support it regardless of use case benefits. Epic might drag its feet as well.
 

Locuza

Member
On the other side, consoles dictate more or less how games are designed and which features they use.
Since many games will use it, the question is how developers will solve this on PC?
Globally deactivate it for every vendor? Making specific paths for each vendor?
Don't care about the other vendors?

Maybe we will see all three options later on.
 

Belmire

Member
You can achieve similar things without the need of CR or ROVs, but CR & ROVs exists because they target directly common problems and inefficiency.
So alternatives will have draw-backs.
If Nvidia can't process Async Compute efficiently, then developers have to look how to handle this problem.

Either looking at a use-case which isn't hurting the performance or using different processing paths per vendor or not using it at all.

Thanks.
 

Vinland

Banned

Look at it like from the perspective of cellphone apps. If Oracle Java has a method in their API that does an array sort using modern intel/amd/sparc/PowerPC hardware supported extensions in their math processors and the latest arm processor does not have any equivalent feature then Google with android and Oracle themselves with the embedded Java have to find a way to get it done. They can try all sorts of ways and if they come close no matter what the methodology no one cares how it was done. If it is a common method call and it is super slow on arm then some devs may decide to do a platform check and call another implementation altogether that forgoes the feature in favor of the faster path. In many cases no one will even notice as they don't have a point of reference to contrast. If you put the Desktop version of the app side by side you may notice some differences.

A lot of times the compiler makes these decisions for you and sometimes you need to actively defend against it. That is why profilers and debuggers are really handy in the development studio of whatever sdk you are running.
 
Given Nvidia's overwhelming marketshare, it's safe to say most devs won't support it regardless of use case benefits. Epic might drag its feet as well.

Epic will drag its feet to enable a feature that current consoles can already take advantage of, and try to sell the engine to multi platform developers.

The world has run out of rationality.
 
DX12-High.png


This is the only bench I can find of the Fury X vs the 980Ti. So why is everyone saying the 980Ti got outperformed?
 

frontieruk

Member
DX12-High.png


This is the only bench I can find of the Fury X vs the 980Ti. So why is everyone saying the 980Ti got outperformed?

The furyx performs the same as the 290x due to having the same architecture for async compute the 7950 also puts up a good show.

The 290x released almost 2 years ago which is why the fuss as it's trading blows with a card that destroys it in dx11
 
The furyx performs the same as the 290x due to having the same architecture for async compute the 7950 also puts up a good show.

The 290x released almost 2 years ago which is why the fuss as it's trading blows with a card that destroys it in dx11
But it's not like people with 980Tis would swap for a 290x and have much worse DX11 performance over the next couple of years just so they could have very efficient DX12 in some future games, maybe. Or that the Fury X is now a better buy than the 980TI.

These new findings basically just mean, great DX11 performance is expensive!
 

Naminator

Banned
The furyx performs the same as the 290x due to having the same architecture for async compute the 7950 also puts up a good show.

The 290x released almost 2 years ago which is why the fuss as it's trading blows with a card that destroys it in dx11

Fury X is Fury X.

Unless AMD has something better out right now I don't see the point of trying to call it a 290X and pretend like it came out 2 years a go.
 

frontieruk

Member
Fury X is Fury X.

Unless AMD has something better out right now I don't see the point of trying to call it a 290X and pretend like it came out 2 years a go.

Except no one in here has actually been talking about the fury x its all come about because the 290x out performs the 980ti under DX12.

I think your being a bit disingenuous by saying I pretended a fury x was a 290x I pointed out that as the underlying architecture is the same that's why the older card has caused the ruckous not the new card. If you'd actually read the thread you'd see I've been one of the more level headed commenters here not actually taking a side but go ahead a pull that fanboy shit out your ass, I've got nothing to hide as I said I've even advised to keep recently bought NV cards as it doesn't mean shit yet.

But it's not like people with 980Tis would swap for a 290x and have much worse DX11 performance over the next couple of years just so they could have very efficient DX12 in some future games, maybe. Or that the Fury X is now a better buy than the 980TI.

These new findings basically just mean, great DX11 performance is expensive!

Which has been pretty much the opinion of the level headed commenters here, you'll even see me advise keeping recently perchased NV cards as at the moment apart from giving some geeks something to speculate about it doesn't mean shit. Most games for the next year are going to support DX11
 
Except no one in here has actually been talking about the fury x its all come about because the 290x out performs the 980ti under DX12.

And the Fury X?

faster of course

Except some people actually did.

And then they keep trying to force AC as a key part of DX12 when it isn't even on the feature set of the API.

This is like saying Tessellation was a key feature of DX10 because many cards of the era were able to support it, but it wasn't a requisite of DX set until DX11.

Who knows, maybe Horse Armour is right and it becomes a requisite for DX13.
 

W!CK!D

Banned
So why is everyone saying the 980Ti got outperformed?

Fury X is different: Devs are used to working with GDDR5 for years and the code is optimized accordingly. HBM is a completely new approach for memory that'll most likely need different memor access patterns to unlock its full potential.
 

frontieruk

Member
Except some people actually did.

And then they keep trying to force AC as a key part of DX12 when it isn't even on the feature set of the API.

This is like saying Tessellation was a key feature of DX10 because many cards of the era were able to support it, but it wasn't a requisite of DX set until DX11.

Who knows, maybe Horse Armour is right and it becomes a requisite for DX13.

Did I say it though? He said I pretended a furyx was a 290x, I pointed out that its the two year old 290x causing the raised eyebrows.


As a side note, the graph is also the one that doesnt show the results where all the fuss started.

YunnYNT.jpg


When 4xMSAA was enabled the fury pulled ahead. Which lead to the whole NV saying the code was broken yaddy yaddy yaya.

Leading us to this thread.
 

W!CK!D

Banned
Interesting - so maybe the XBox's 2 ACE units are fine after all, if I've understood correctly.

It's too early to judge the value of Sony's additional resources. They didn't pack 8 ACEs and 64 queues for no reason.

You have to consider that things like allocated resources and priorities for async shaders as well as communication between ACEs happen on driver level. It's impossible to predict what console devs will squeeze out of 8 ACEs manually.
 

Arkanius

Member
Beyond3D have gotten more results, and it Async Computing in Maxwell 2 is "supported" through the Driver offloading the Computing calculations to the CPU and back, hence the huge delay added, and why it was faster for Oxide to disable it all together for Nvidia.

https://forum.beyond3d.com/threads/dx12-async-compute-latency-thread.57188/page-21#post-1869774

Anyhow, sebbi says this:

This is not a performance (maximum throughput) benchmark. However it seems that less technically inclined people believe it is, because this thread is called "DX12 performance thread". This thread does't in any way imply that "asynchronous compute is broken in Maxwell 2", or that "Fiji (Fury X) is super slow compared to NVIDIA in DX12 compute". This benchmark is not directly relevant for DirectX 12 games. As some wise guy said in SIGGRAPH: graphics rendering is the killer-app for compute shaders. DX12 async compute will be mainly used by graphics rendering, and for this use case the CPU->GPU->CPU latency has zero relevance. All that matters is the total throughput with realistic shaders. Like hyperthreading, async compute throughput gains are highly dependent on the shaders you use. Test shaders that are not ALU / TMU / BW bound are not a good way to measure the performance (yes I know, this is not even supposed to be a performance benchmark, but it seems that some people think it is).

This benchmark has relevance for mixed tightly interleaved CPU<->GPU workloads. However it is important to realize that the current benchmark does not just measure async compute, it measures the whole GPU pipeline latency. The GPUs are good at hiding this latency internally, but are not designed to hide it to external observers (such as the CPU).
 
While still early, its looking like AMD played the long game beautifully on this one. Its exciting that the GPU space is interesting for the first time in years, and I hope this brings AMD back in a big way and lights a fire under Nvidia's ass.

Everybody wins. I know is a foreign concept for a lot of the internet, but its possible.
 

Alexlf

Member
Ouch! Nvidia better release some driver updates to allow async compute in-card or this is going to look reeeally bad. Not that it doesn't already.
 

dr_rus

Member
So this graph (from the same B3D thread) kinda shows that Maxwell 2 does support async compute but it's implementation is far from ideal as there are a lot of cases where running things serially may actually be faster:

vevF50L.png


Note the section with 2-31(ish) threads though - async compute is always faster than serial there. Couple this with what we know of the best example of async compute currently (and with some general knowledge of how this stuff works) and I would say that Maxwell 2 will handle async more or less fine in the first generation of DX12 titles (and it's not clear that we'll get the second one while this console gen is going).

There's also this graph:

3qX42h4.png


Which kinda illustrate what I've said earlier about async making latencies less predictable and possibly leading to hitches in graphics thread. Note as well that this is even worse on 1.2 Fiji.

This post is good at explaining some stuff as well:

Again, this "async compute" is not an API feature - it's not an optional capability that can be exposed to the API programmer. This is a WDDM driver/DXGK feature which can improve performance in GPU-bound scenarios. Developers would just use compute shaders for lighting and global illumination, and in AMD implementation there are 2 to 8 ACE (asynchronous compute engine) blocks which are dedicated command processors that completely bypass the rasterization/setup engine for compute-only tasks. In theory this means additional compute performance without stalling the main graphics pipeline.

Parallel execution is actually a built-in feature in the Direct3D 12 - it's called "synchronization and multi-engine". There are three sets of functions for copy, compute and rendering, and these tasks can be parallelized by runtime and driver when you have the right hardware. You just need to submit your compute shaders to the Direct3D runtime using the usual API calls, and on high-end AMD hardware with additional ACE blocks, you may use larger and more complex shaders and/or create additional command queues using multiple CPU threads. This will saturate the compute pipeline and you would still get fair performance gains comparing to the traditional rendering path.

So when Oxide said they had to query hardware IDs for Nvidia cards then disable some features in the rendering path, it makes sense. When they talk about console developers getting 30% gains by using "async compute" - i.e. using compute shaders to accelerate lighting calculations in parallel to the main rendering stack - it makes sense as well.

But when Oxide says that the 900-series (Maxwell-2) don't have the required hardware but the Nvidia driver still exposes "async compute" capability, I don't think they can really tell this for sure, because this feature would be exposed through DXGK (DirectX Graphics Kernel) driver capability bits, and these are driver-level interfaces which are only visible to the DXGI and the Direct3D runtime, but not the API programmer (and the MSDN hardware developer documentation for WDDM 2.0 and DXGI 1.4 does not exist yet).

They are probably wrong on hardware support too, since Nvidia asserted to AnandTech that the 900-series have 32 scheduling blocks, of which 31 can be used for compute tasks.

So if Nvidia really asked Oxide to disable the parallel rendering path in their in-game benchmark, that has to be some driver problem rather that missing hardware support. Nvidia driver probably doesn't expose the "async" capabilities yet, so the Direct3D runtime cannot parallelize the compute tasks, or the driver is not fully optimized yet... not really sure, but it would take me quite enormous efforts to investigate even if I had full access to the source code.

I'm saying that referring to a non-granular system as merely having a "granularity difference" is generous and misleading. You're implying that NV's approach is somewhat granular, but it really isn't.
You can say whatever you want but that won't make the granularity difference into something else. Having a on/off granularity (i.e. serial execution only) is still a granularity choice which can be compared as a coarser granularity, and Maxwell 2 to my knowledge has a finer granularity than that (i.e. it does support running compute threads in parallel to graphics thread).

I refer to it as "broken" because NV refer to it as "fully compliant." Yes, it doesn't crash in response to the command, but the operations intended to improve performance instead degrade it. So I assume it's actually intended to deliver the claimed functionality, and generously refer to it as broken, yes. But you may be right too; maybe it was never intended to work correctly, and they were just misleading us when they said it would.
There are no implementation requirement for async compute in either WDDM 2.0 or DX12. You can support it in a serialized fashion or as a coarse grained async pipeline or as a finer grained one. Note that GCN is the only h/w on the market right now which actually does support it in a fine grained fashion.

Then I imagine you won't have any trouble providing us with some links.
Or, sure, if I'll stumble upon one next time I'll post it here, no problem.

Completely untrue. There are always unused resources, because not every processor is needed in every phase of the rendering pipeline. Try to keep up.
The amount of idle resources in a GPU is totally dependent on the workload this GPU is doing at the moment. Saying that there are always unused resources in a GPU is a plain lie.
What's more important to the question at hand is that the amount of idle resources in a GPU is completely dependent on the said GPU's architecture. NV GPUs are known for their ability to achieve higher performances with smaller FLOPs / SPs / die sizes than their GCN counterparts. They are able to do this because their architecture is made specifically to minimize the amount of idle blocks per time slice and to achieve that they try to extract more ILP per each clock than GCN's counterparts.
This could mean that the reason NV didn't do the same level of TLP in Maxwell as AMD has in GCN is because they simply doesn't have as much idling resources in their GPUs and going for a more efficient TLP would be a waste of effort as they won't be able to run compute threads in parallel to graphic one simply because of utilization of available resources being peaked already.
Are you keeping up?

So you claim they admitted to not getting a lot of performance out of the feature, despite his actual statement being that he got a noticeable improvement with only a modest amount of effort. When I call you out on completely misrepresenting what he said, your defense is, "No, he's the liar!!" ><
My defense is well stated above but you seems to be unable to comprehend it so I won't bother repeating myself.

It's a useful technique on any architecture that implements it correctly.
There are no "correct" implementation of TLP. Even the need for TLP is completely task dependent. It may well be that a "correct" implementation would be to not implement it at all.
How are you doing on keeping up with me?

This benchmark isn't designed to test actual performance; the GCN cards are dispatching jobs half-filled. This benchmark merely tests for the presence of fine-grained compute. The AMD cards pass that test, while the NV cards fail. We can't compare fine-grained performance because the current NV cards aren't capable of doing it at all.

Did that clear things up for you?
Things were rather clear for me from the start - we're discussing alpha software running on alpha drivers in a game made on AMD money to promote Mantle. And MDolenc's synthetic benchmark is actually showing a lot more stuff than you pretend its showing. Performance on this particular task is a result as much as anything else, don't try to diminish it.

Beyond3D have gotten more results, and it Async Computing in Maxwell 2 is "supported" through the Driver offloading the Computing calculations to the CPU and back, hence the huge delay added, and why it was faster for Oxide to disable it all together for Nvidia.

https://forum.beyond3d.com/threads/dx12-async-compute-latency-thread.57188/page-21#post-1869774

Anyhow, sebbi says this:
Running any GPU workload on CPU for "emulation" (emulation of what and why would they even emulate this?) is a completely stupid idea all the time. I don't believe in this for a second. The CPU load is likely related to WDDM hitting the timeouts on Maxwell more than anything else.
 

AP90

Member
Ok.. I attempted to read and understand numerous tech lingo iterated above on the pc/desktop end.. I think I have a slight understanding now.

So what does this mean for current AMD laptop gpus (R9 M200series) in the future (next 2 years.. Obviously the R9 M300series when released will be a leap.

Secondly.. Does the ACE feature in the cpugpu setup for consoles potentially provide a boost in performance? (Sony, MS and Nintendo)...aka giving this Gen a long stride like last Gen?
 
Top Bottom