• Hey, guest user. Hope you're enjoying NeoGAF! Have you considered registering for an account? Come join us and add your take to the daily discourse.

Oxide: Nvidia GPU's do not support DX12 Asynchronous Compute/Shaders.

dr_rus

Member
My take on it ATM, keep your 980ti's, async is probably going to be used for improved lighting and AA which aren't going to over load draw calls, so you'll see nvidia ahead in most games as AMD will suffer from latency of getting workloads into the queues where NV is quicker at low levels of async compute. If only we could get benches from the fable beta :( to confirm this
Compute is mostly used for post processing and animation simulation (hair, physics, etc.) It doesn't matter what it's used for though as much as how much compute jobs are launching in parallel to the graphics one. If the MDolenc's benchmark gives an accurate comparative picture then Maxwell 2 is fine at handling 31-63 compute jobs and starts to fall back behind GCN when this nubmer goes higher. Still, all this is empty talk outside of real world examples so I hear you when you say that we should wait for Fable benchmarks. Tomorrow's Children is using only 3 compute queues in addition to the graphics one for example so Maxwell 2 may be just fine running D3D12 console ports and not PC exclusive stress tests like AoS.

I'm pointing out that those wanting to switch from a 980ti to a 290 aren't even looking at the best GCN gpus.
GCN 1.2 has minimal changes compared to 1.1 though - FB color compression enhancements and FP16 precision support with minimal performance gains - so from a user perspective 290 isn't that far from being a "best GCN" right now.

That's closer to my understanding of how this stuff works. Basically, you've got turnstile-style access to a fixed pool of resources; the various math units on the GPU, each with its own specialty. So think of it like loading a roller coaster. Every cycle, the system hangs the next rendering job on the GPU, occupying some or all of those specialized units. These jobs are the people who paid for VIP passes. Then the system looks at the math units that haven't been assigned jobs, compares that to the 64 jobs waiting at their respective turnstile — all managed by eight line attendants — and lets in whatever punters best fill the remaining seats before dispatching the train.

It sounds like NV do something similar, but instead of filling empty seats every cycle with jobs from the 31 compute queues, they actually alternate job types, pulling a job from the render queue on even cycles and a job (more?) from the compute queues on the odd cycles. Then they're saying, "Well, at the end of the day, everybody gets to ride." While it's true they're seamlessly pulling jobs from both queue types, because they can't pull from multiple queue types simultaneously, they're not actually doing much to increase utilization. Any math unit not used in a given render operation remains idle; it just gets used on the following cycle. I'm assuming they'd at least be able to pull from all 31 queues on the compute cycle to attempt to fully saturate the math units, but they'd still have a lot of idle units on the render cycle.
NV's async granularity is lower than that of GCN but that's as much as we know right now.

Is this a trick question because adding async to the mix "just" increases your peak utilization? It will have empty spaces in its rendering pipeline that need filling, just like any other GPU, if that's what you're asking.
No, it's not a trick question as it's pretty obvious that async shaders can actually lead to _worse_ utilization than serial execution when done in a wrong way - this is especially true for architectures which aren't build for fast context switching and are built for maximum throughput inside one context - which coincidentally is what Maxwell 2 is. If you want to have an example of how this may happen look no further than HT lowering your CPU performance in some benchmarks on PC - same thing may easily happen with async compute on GPUs.

Async shaders is hardly a magic pill which will make everything faster everywhere, saying that it will is just stupid. There's a lot of talk in Q's presentation on how they've tweaked the wavefronts specifically for PS4's GPU to get the maximum out of async. There's that tidbit from UE4's Fable async compute code submission which says that it should be used with caution as the results may actually be worse than without it. There's also the OP's statement on them not getting a lot of performance out of the feature at all. So it's not a clear cut on if a game should even use it on PC - as it's highly dependent on the workloads in question.

How likely is it that Pascal will have Asynchronous Compute?
Even Kepler (GK110) have async compute, it's just can't run it alongside graphics job. And to my knowledge Maxwell 2 support async compute with graphics just fine though obviously the architectural choices are different from GCN or any other architecture out there and the benefits from running stuff asynchronously may be way less than on GCN.

In general, and aside from moving from 28nm to 16nm FF+, having HBM2 and double rate FP16, shouldn't Pascal have more architectural changes over Maxwell than Maxwell had over Kepler?
Unlikely as Pascal appeared on the roadmap between Maxwell and Volta with 3D memory feature moved into it. I'm thinking that Pascal is basically a tweaked Maxwell with HBM bus and the next big architecture update from NV will be in Volta. But who knows? This stuff changes every month.

The performance of a video game is defined by more than just frames per second or frametimes. Asynchronous compute allows for higher throughput at lower latencies which easily makes it one of the most important features for VR gaming. Remember the beginning of this gen when Mark Cerny explained again and again the importance of async compute for the future of video games? That was before Morpheus was announced. Two years later it all makes sense.
Asynchronous compute makes latencies somewhat unpredictable so in the end it may be a bad idea for a code which is latency critical. You seems to be mixing a specific VR timewarp case with async compute in general. (I'm also pretty sure that this gen of VR won't be nearly as big as some of you think but that's just me.)
 
We, IMO, still really have a hard idea understanding what is happening. Has that benchmark gotten better yet?

This explains it well:

nVidia does async compute differently (and less efficiently) than AMD, but it still does it using context switching. Guess what? AMD uses context switching too, but they have 8 engines 8 queues and nVidia has 1 engine 32 queues - AMD can context switch faster, nVidia can fill queues faster. In a large draw-call situation, nVidia has even pointed out in their whitepapers that their context switching will take a hit (guess what Ashes likely has in it? huge draw calls).

A game dev posts "As far as I know, Maxwell doesn’t really have Async Compute" (where in the previous sentence he said it was functional, but didn't work well with their code) and now Maxwell is crippled, doesn't support DX12, and doesn't do async compute? Don't get me wrong - GCN's architecture is much better suited to async compute, it was built for it. nVidia, however, still supports it - async timewarp, which relies on async compute, is a huge part of GameworksVR and what allows them to get frame times down to sub 2ms. https://www.reddit.com/r/oculus/comments/3gwnsm/nvidia_gameworks_vr_sli_test/

I'm still just boggled by how quickly this misinformation was picked up and ran with, I bet the PR folks at nVidia are going bonkers.

Horrible oversimplification:

AMD GCN: 8 engines, 8 queues = potential for 64 queues
Nvidia Maxwell: 1 engine, 32 queues = potential for 32 queues
 

dogen

Member
This explains it well, from the horse's mouth, so to speak (Nvidia guy):



Horrible oversimplification:

AMD GCN: 8 engines, 8 queues = potential for 64 queues
Nvidia Maxwell: 1 engine, 32 queues = potential for 32 queues

I'm 99% sure that guy doesn't work for vnidia.
 

dr_rus

Member
Won't this just put developers in the position of choosing extra performance on AMD at the expensive of Nvidia chips?

It's not clear of what performance we're talking about here. AoS is getting less than +30% (and by less I think they mean way less - around 5-10%) and the best example we have right now is Tomorrow's Children which is getting around +30% on a fixed platform which they can fine tune the code to rather extensively. The latter is running only 3 compute queues (even if that's actually 3 ACEs with 8 queues each this still gives us 24 queues which is less than 31 limit for Maxwell 2) which means that they didn't see much benefit in running more of them.

It may as well be that while GCN can handle loads of asynchronous queues with little loss of performance it won't actually be able to execute these queues in real time - each queue is still a program which still must be executed; the more you have - the longer time you'll need to execute all of them. Will it be of any benefit to anyone if some code will run on a Fury at 5 fps while 980Ti will handle it only with 1 fps? We really need more real games using the feature before we'll be able to make any conclusions.
 

frontieruk

Member
MDolenc said:
Well... That's interesting... Found a brand new behaviour on my GTX 680... Will post a new version a bit latter I still want to implement gpu timestamps could indicate better what's going on on GCN

Ooh we could be getting closer...
 

DonasaurusRex

Online Ho Champ
It's not clear of what performance we're talking about here. AoS is getting less than +30% (and by less I think they mean way less - around 5-10%) and the best example we have right now is Tomorrow's Children which is getting around +30% on a fixed platform which they can fine tune the code to rather extensively. The latter is running only 3 compute queues (even if that's actually 3 ACEs with 8 queues each this still gives us 24 queues which is less than 31 limit for Maxwell 2) which means that they didn't see much benefit in running more of them.

It may as well be that while GCN can handle loads of asynchronous queues with little loss of performance it won't actually be able to execute these queues in real time - each queue is still a program which still must be executed; the more you have - the longer time you'll need to execute all of them. Will it be of any benefit to anyone if some code will run on a Fury at 5 fps while 980Ti will handle it only with 1 fps? We really need more real games using the feature before we'll be able to make any conclusions.

system time or user time though, because if the user time is shorter the gamer wont care if in actuality more cpu clk is being used .
 

bj00rn_

Banned
So glad I never upgraded to Maxwell

I'm curious; what exactly do you mean by that? I mean, even the top technical wizards around are struggling to go either clear way in this matter yet. So it would be interesting to hear the details behind your conclusion :)
 

bj00rn_

Banned
Have any real technical wizards actually weighed in on this yet?

I don't know who's a legitimate technical wizard or not. That part was a bit tongue-in-cheek. But the point is still that even those who normally claim authority on the internets are unusually low key about a clear conclusion in this matter. So how is it possible to come to a conclusion like the previous poster did..
 
For anyone that bought a 980ti, you're playing in high end territory now. You should have known you were gonna be obsolete really quickly. That's just how it works. FWIW, my last two cards were 780ti and currently 980ti. I pour hundreds of dollars down the drain because it's fun.
 

frontieruk

Member
hey that's me! lol

but anyway, i believe they were still speculating about the extra 40-50ms overhead on amd.

that's on CPU timing though, the new test has added GPU timings which is probably why the creator is seeing a new trend, but is waiting for more data sources from Nvidia and the AMD guys in case it's just card specific.
 

dogen

Member
that's on CPU timing though, the new test has added GPU timings which is probably why the creator is seeing a new trend, but is waiting for more data sources from Nvidia and the AMD guys in case it's just card specific.

Yeah, that's what we're all waiting for. should be interesting...
 

dogen

Member
First result.
290
Compute only ranges from 28ms to 420ms for 512 threads.
Graphics only is 36ms
Graphics + compute ranges from 28ms to 395ms for 512.
Graphics + compute single command list ranges from 54ms to 250ms


now a 980 ti
compute ranges from 5.7ms to 76.9ms
graphics only result is 16.5ms
graphics + compute result is 20.9ms to 92.4ms
graphics + compute single command list result:
20.6ms to over 3000ms(after 454 the timer seems to bug out)
 

FtsH

Member
First result.
290
Compute only ranges from 28ms to 420ms for 512 threads.
Graphics only is 36ms
Graphics + compute ranges from 28ms to 395ms for 512.
Graphics + compute single command list ranges from 54ms to 250ms

Now we just need a geforce to compare it to.

I need some education in layman's terms here. How should I relate these numbers to the GPU's capability? Is there some simple ways to just say "low is better" or "narrower range means Async" ?
 

frontieruk

Member
First result.
290
Compute only ranges from 28ms to 420ms for 512 threads.
Graphics only is 36ms
Graphics + compute ranges from 28ms to 395ms for 512.
Graphics + compute single command list ranges from 54ms to 250ms


now a 980 ti
compute ranges from 5.7ms to 76.9ms
graphics only result is 16.5ms
graphics + compute result is 20.9ms to 92.4ms
graphics + compute single command list result:
20.6ms to over 3000ms(after 454 the timer seems to bug out)

I was just looking over that. :D
 

dogen

Member
I need some education in layman's terms here. How should I relate these numbers to the GPU's capability? Is there some simple ways to just say "low is better" or "narrower range means Async" ?

A graphics + compute result(processing time in ms) being smaller than the equivalent separate graphics and compute tests combined would imply async compute, as long as the tasks themselves conducive to async compute(i.e. have different bottlenecks)
 

Irobot82

Member
A Fury X result got posted. That 980ti one bugged out before it could finish?

Also: The compiled results thus far in visual form.

Here

Thanks! So that makes sense.

Is there any significance on the numbers themselves? Not trying to get into a fanboy war but is it possible to conclude something like "even X card doesn't support Async, it still delivers better real-world performance than Y" ?

In real world, it means nothing right now as there are ZERO DX12 games on the market. So....who knows!
 

FtsH

Member
A graphics + compute result(processing time in ms) being smaller than the equivalent separate graphics and compute tests combined would imply async compute, as long as the tasks themselves conducive to async compute(i.e. have different bottlenecks)

Thanks! So that makes sense.

Is there any significance on the numbers themselves? Not trying to get into a fanboy war but is it possible to conclude something like "even X card doesn't support Async, it still delivers better real-world performance than Y" ?
 

Blanquito

Member
I love this place. Thanks for your time and future input.

A graphics + compute result(processing time in ms) being smaller than the equivalent separate graphics and compute tests combined would imply async compute, as long as the tasks themselves conducive to async compute(i.e. have different bottlenecks)

So, looking at those results... it appears the Nvidia card isn't showing async compute, correct?
 

frontieruk

Member
First result.
290
Compute only ranges from 28ms to 420ms for 512 threads.
Graphics only is 36ms
Graphics + compute ranges from 28ms to 395ms for 512.
Graphics + compute single command list ranges from 54ms to 250ms


now a 980 ti
compute ranges from 5.7ms to 76.9ms
graphics only result is 16.5ms
graphics + compute result is 20.9ms to 92.4ms
graphics + compute single command list result:
20.6ms to over 3000ms(after 454 the timer seems to bug out)
Fury X results
Compute only
26.03ms - 467.68ms

Graphics only: 26.01ms (64.50G pixels/s)

Graphics + compute:
26.70ms (62.83G pixels/s) to 512. 443.43ms (3.78G pixels/s)

Graphics, compute single commandlist:
52.04ms (32.24G pixels/s) [25.97] {64.50 G pixels/s} to 512. 234.23ms (7.16G pixels/s)
 

KKRT00

Member
Fresh Win 10, new GTX 970, 355.60 drivers
This is how GPU utilization looked like on this benchmark:

ei0wWLM.png
 

dogen

Member
Thanks! So that makes sense.

Is there any significance on the numbers themselves? Not trying to get into a fanboy war but is it possible to conclude something like "even X card doesn't support Async, it still delivers better real-world performance than Y" ?

I don't think it's a good real world test. I don't remember what it's doing, it might not be a similar workload to games at all.

https://forum.beyond3d.com/threads/dx12-performance-thread.57188/page-9#post-1869028

So the graphics part is just pushing triangles, and is fillrate bound. That's very likely why nvidia is winning in the graphics only portion, they have a much higher fillrate(right?). I think that's also a good indicator that it's not necessarily a very real world test, I don't think games are often extremely fillrate bound(portions of the render process might be though).
 

Macrotus

Member
Fresh Win 10, new GTX 970, 355.60 drivers
This is how GPU utilization looked like on this benchmark:

ei0wWLM.png

I'm not familiar with these type of things, but is that graph a positive one? or a negative one?
I'm concerned cause I also use a GTX 970.
 

FtsH

Member
In real world, it means nothing right now as there are ZERO DX12 games on the market. So....who knows!

Well...that's true.....

And what's the reason for Nvidia cards to show the ladder pattern while GCN cards gave flat numbers across the test?
 

FtsH

Member
I don't think it's a good real world test. I don't remember what it's doing, it might not be a similar workload to games at all.

Cool. so what's the reason for Nvidia cards to show this ladder pattern while GCN cards gave flat numbers across the test?
 

dogen

Member
Cool. so what's the reason for Nvidia cards to show this ladder pattern while GCN cards gave flat numbers across the test?

Not sure.


edit - from my last post

So the graphics part is just pushing triangles, and is fillrate bound. That's very likely why nvidia is winning in the graphics only portion, they have a much higher fillrate(right?). I think that's also a good indicator that it's not necessarily a very real world test, I don't think games are often extremely fillrate bound(portions of the render process might be though).
 

KKRT00

Member
I'm not familiar with these type of things, but is that graph a positive one? or a negative one?
I'm concerned cause I also use a GTX 970.

It means that it is not working like it should, yes.
But it also seems like thats more a driver problem.

---
could those be the slow context switches amd was talking about?
I really doubt it, it more looks like something is fundamentally broken, probably with drivers.
 

FtsH

Member
Not sure.


edit - from my last post

So the graphics part is just pushing triangles, and is fillrate bound. That's very likely why nvidia is winning in the graphics only portion, they have a much higher fillrate(right?). I think that's also a good indicator that it's not necessarily a very real world test, I don't think games are often extremely fillrate bound(portions of the render process might be though).

Thanks again. So basically don't read too much into the results other than the info related to Async.
 

dogen

Member
Thanks again. So basically don't read too much into the results other than the info related to Async.

Even then I wouldn't say we can be completely sure, even though that's what it looks like. Maybe it really is driver related. We don't know yet.
 
NV's async granularity is lower than that of GCN but that's as much as we know right now.
It's sounding like referring to NV's approach as "granular" at all may be a bit generous.

No, it's not a trick question as it's pretty obvious that async shaders can actually lead to _worse_ utilization than serial execution when done in a wrong way - this is especially true for architectures which aren't build for fast context switching and are built for maximum throughput inside one context - which coincidentally is what Maxwell 2 is.
Well, obviously a broken implementation isn't gong to help much, but that doesn't imply they wouldn't benefit from a proper one.

Async shaders is hardly a magic pill which will make everything faster everywhere, saying that it will is just stupid.
Well, then it's a good thing no one is claiming that. Again, this is just a tool, and as such the results will depend on the project in question, the skill of the developer in using the tool, and as these tests are showing, the quality of the tool itself.

There's that tidbit from UE4's Fable async compute code submission which says that it should be used with caution as the results may actually be worse than without it.
It also says, "This is a good way to utilize unused GPU resources."

There's also the OP's statement on them not getting a lot of performance out of the feature at all.
That's pretty much the opposite of what it says in the OP. "Ashes uses a modest amount of [Async Compute], which gave us a noticeable perf improvement."

So it's not a clear cut on if a game should even use it on PC - as it's highly dependent on the workloads in question.
It's not clear of what performance we're talking about here. AoS is getting less than +30% (and by less I think they mean way less - around 5-10%) and the best example we have right now is Tomorrow's Children which is getting around +30% on a fixed platform which they can fine tune the code to rather extensively. The latter is running only 3 compute queues (even if that's actually 3 ACEs with 8 queues each this still gives us 24 queues which is less than 31 limit for Maxwell 2) which means that they didn't see much benefit in running more of them.

It may as well be that while GCN can handle loads of asynchronous queues with little loss of performance it won't actually be able to execute these queues in real time - each queue is still a program which still must be executed; the more you have - the longer time you'll need to execute all of them. Will it be of any benefit to anyone if some code will run on a Fury at 5 fps while 980Ti will handle it only with 1 fps? We really need more real games using the feature before we'll be able to make any conclusions.
Frankly, this is starting to sound like concern trolling. The fact that its utility varies does not diminish the technique in any way. It's a useful technique.
 

frontieruk

Member

As you're here...

Wheres my Beta invite :'(

I signed up at the very very very first announcement.

my mate signed up just after E3 and is in, he can't play as he can't match make but it's hella unfair :(

Joking...

maybe...

Nice to see you partaking in the fun btw ;)
 

Irobot82

Member
What the fack is that, all these bars and numbers o_O

Per the creator

Each bar in the chart shows the time it took for the async compute to finish.
The red block that floats to the top is the time it would take for the compute, by itself, to finish.
The blue block at the bottom is the time it would take for the graphics, by itself, to finish.

What we want here is for the red and blue to overlap, this signifies the async compute running faster than if you were to run the compute and graphics separately.
Sometimes we see a white gap between the 2 colors, this signifies that the async compute run is slower than it would have been if the two were run separately.
 
More fuel to the fire!

Maxwell cards are now also crashing out of the benchmark as they spend >3000ms trying to compute one of the workloads.

AMD_Robert said:
The author is not interpreting the results correctly.
Look at the height of the graphics bars.
Look at the height of the compute bars.

Notice how NVIDIA's async results are the height of those bars combined? This means the workloads are running serially, otherwise compute wouldn't have to wait on graphics and the bars would not be additive.
Compare that to the GCN results. Compute and graphics together, async shading bars are no higher than any other workload, demonstrating that frame latencies are not affected when the workloads are running together.
//EDIT: Asynchronous shading isn't simply whether or not a workload can contain compute and graphics. It's whether or not that workload overlay graphics and compute, processing them both simultaneously without the pipeline latency getting any longer than the longest job. This is what GCN shows, but Maxwell does not.
//15:45 Central Edit: This benchmark has now been updated. GPU utilization of Maxwell-based graphics cards is now dropping to 0% under async compute workloads. As the workloads get more aggressive, the application ultimately crashes as the architecture cannot complete the workload before Windows terminates the thread (>3000ms hang).

https://www.reddit.com/r/pcgaming/comments/3j87qg/nvidias_maxwell_gpus_can_do_dx12_async_shading/
 
Top Bottom