• Hey, guest user. Hope you're enjoying NeoGAF! Have you considered registering for an account? Come join us and add your take to the daily discourse.

AMD: PlayStation 4 supports hUMA, Xbox One does not

Myshkin

Member
The only thing I don't get is paging... GPU has to work with paging enabled, so shaders are context-binded to a thread ?

I don't see where hUMA itself stops you from passing to the GPU arbitrary pointers to both code and data. Where is the hardware support for memory access protections implemented?
 

IN&OUT

Banned
PS4 has a unified memory that work inline with hUMA concept. it's not AMD problem that MS fucked up the design of X1 w/ esram and slow DDR3 memory.

Now MS is angry and started pulling strings? why not equip X1 with future proof specs to avoid all this in the first place instead of being cheap ass and charging premium for inferior tech!

MS is setting on a mountain of CASH, they could've easily created the most powerful console ever conceived ! but they think people are stupid, they don't understand specs and hardware. funny thing that we have Sony which is in the brink of bankruptcy investing in cutting edge and expensive RAM, equip PS4 with better GPU , and devoted a 5 year project to develop PS4 with an eye to industry future trends and needs to future proof the console. also above all that we find Sony charging less for PS4 despite the very superior tech found inside it compared to X1 !

MS just don't care, they said it themselves.
 
This should explain what hUMA is.

AMD-Kaveri-hUMA-shared-memory-600x337.jpg


The only thing I don't get is paging... GPU has to work with paging enabled, so shaders are context-binded to a thread ?

Hmm. Maybe neither PS4/X1 have hUMA then.

PS4 can snoop CPU cache, but I don't think there is a similar bus to do that for the CPU to GPU cache. See edit below.

I don't think there is any sort of that feature in the X1 at all. Not with what the documents say at least.

Major edit:

Look at this.


Look at the Onion/+ Bus.

It's bidirectional.

Onion = GPU > GPU cache > CPU Cache > Main RAM. Vice versa for other way around.
Onion+ = GPU > CPU Cache > Main RAM. Vice versa.

Once the pipeline hits either the CPU or GPU's cache, the system SHOULD be able to directly access the CPU/GPU without having to go all the way out to main RAM.

So, it SHOULD have coherency...right?
 

ElTorro

I wanted to dominate the living room. Then I took an ESRAM in the knee.
So, it SHOULD have coherency...right?

Yes, since both the CPU/GPU probe the CPU's caches via Onion, if you choose to use that bus, CPU and GPU are fully cache-coherent. And thanks to the volatile tag, the GPU can bypass its own caches, meaning that they are not compromised by CPU/GPU interaction. This seems to be different on the XB1 where the GPU cache, from my understanding of the leaked documents, must be flushed.
 
Yes, since both the CPU/GPU probe the CPU's caches via Onion, if you choose to use that bus, CPU and GPU are fully cache-coherent. And thanks to the volatile tag, the GPU can bypass its own caches, meaning that they are not compromised by CPU/GPU interaction. This seems to be different on the XB1 where the GPU cache, from my understanding of the leaked documents, must be flushed.

But doesn't this mean huma like computing is limited by the relatively narrow bandwidth of onion? (relatively narrow for gpu, not cpu) Do gpgpu computing typically require a lot of bandwidth?
 
But doesn't this mean huma like computing is limited by the relatively narrow bandwidth of onion? (relatively narrow for gpu, not cpu) Do gpgpu computing typically require a lot of bandwidth?

IIRC Cerny said that you can explicitly flag as "coherent" or "not coherent" different memory zones (probably on a page level) so a clever paging don't engulf onion without a reason.
 
CPU and GPU working in concert requires low latency more than anything. That's what Onion is for. And with its 20GB/s it even has 25% more bandwidth than the PCIe bus in a PC. For tasks that are not latency sensitive (rendering or GPGPU for eye candy) you can use the Garlic bus with maximum bandwidth.

And for clarification: Onion bus was introduced with Llano a couple of years ago. It's also called "Fusion Compute Link". Only because a system has this busses doesn't automatically mean that it has hUMA. HSA is super complocicated, isn't it? ^_^

Well, with huma, I'd imagine the GPU utilizing its many cores and high bandwidth to work on huge data structures while the CPU takes over infrequent branch conditions (since gpus suck at that as far as I know). So I think the GPU would still benefit a lot from high bandwidth.

But since this is a gaming system and the GPU will probably spend most of its time on graphics, hopefully, onion may be enough. 20 GB/s is an overkill for jaguar cpus anyways.

I'll just have to wait and buy an amd kaveri laptop to play around with this tech. Gonna be very educating.
 
PS4 has a unified memory that work inline with hUMA concept. it's not AMD problem that MS fucked up the design of X1 w/ esram and slow DDR3 memory.

Now MS is angry and started pulling strings? why not equip X1 with future proof specs to avoid all this in the first place instead of being cheap ass and charging premium for inferior tech!

MS is setting on a mountain of CASH, they could've easily created the most powerful console ever conceived ! but they think people are stupid, they don't understand specs and hardware. funny thing that we have Sony which is in the brink of bankruptcy investing in cutting edge and expensive RAM, equip PS4 with better GPU , and devoted a 5 year project to develop PS4 with an eye to industry future trends and needs to future proof the console. also above all that we find Sony charging less for PS4 despite the very superior tech found inside it compared to X1 !

MS just don't care, they said it themselves.

This is late, but it's worth addressing I think. it's erroneous to say MS doesn't care and prefers to sit on a mountain of cash while giving gamers shit hardware and telling them to go F themselves if they don't like it.

MS is publically traded and answerable to shareholders. they already lost their shirts establishing xbox two Gens back. a second money losing console of that nature isn't happening. Xbone needs to have a clear path to profitability.

second, the xbox division doesn't really make it all that much money for MS. Other divisions are much more important, and at the moment they're struggling with windows 8, are fending off google docs as an office competitor, losing badly with windows phone and getting completely, utterly and totally owned with surface. they can't afford to throw money at the xbox just to kill Sony, there's too much else at stake for them.
 
CPU and GPU working in concert requires low latency more than anything. That's what Onion is for. And with its 20GB/s it even has 25% more bandwidth than the PCIe bus in a PC. For tasks that are not latency sensitive (rendering or GPGPU for eye candy) you can use the Garlic bus with maximum bandwidth.

And for clarification: Onion bus was introduced with Llano a couple of years ago. It's also called "Fusion Compute Link". Only because a system has this busses doesn't automatically mean that it has hUMA. HSA is super complocicated, isn't it? ^_^
Yeah,until someone says if really there is a real unified memory address instead of a virtual one we will not know if its hUMA or not.I bet is still virtual like Llano or Cerny would have said the contrary.
 
Are you sure kaveri won't have an option for gddr5? I'm sure I've read somewhere they are gonna release an apu based on ps4's design. So they already have the memory controller worked out. Gddr5 dimms may not exist yet but a lot of ultrabooks these days have memory chips on the motherboard itself. Also if they go for an apu with say, 6 cu's, may be ddr3 will be enough.
 

Panajev2001a

GAF's Pleasant Genius
CPU and GPU working in concert requires low latency more than anything. That's what Onion is for. And with its 20GB/s it even has 25% more bandwidth than the PCIe bus in a PC. For tasks that are not latency sensitive (rendering or GPGPU for eye candy) you can use the Garlic bus with maximum bandwidth.

And for clarification: Onion bus was introduced with Llano a couple of years ago. It's also called "Fusion Compute Link". Only because a system has this busses doesn't automatically mean that it has hUMA. HSA is super complocicated, isn't it? ^_^

In a console environment there often are additional paths which in a way break nice and easy abstractions but provide additional performance headroom. This could be a way to explain both the use of onion, onion+, and garlic as you were also explaining in your post.

From leaks and interviews, it is apparent that, according to the memory region you store data in, you will access data through garlic or onion and if you want to skip GPU data caches you can access data through onion+.

It is possible to think that for hUMA to work, for addresses to be shared between CPU and GPU, AMD needed both a new GPU core compared to their previous APU's as well as other software customizations including OS support which would take much less time to enable in a custom BSD based OS than on Windows and OS X.

Also, it is conceivable that hUMA works only when CPU and GPU access data mapped on the onion memory region and not when accessed over onion+ or garlic. It would make sense and it would not be too hard to manage either IMHO for developers accustomed to crazier setups :).
According to Cerny's Digital Foundry/EG interview and what he was saying about the HSA software stack, it is possible that it might be a feature that will get in the SDK post launch.
 

ElTorro

I wanted to dominate the living room. Then I took an ESRAM in the knee.
But doesn't this mean huma like computing is limited by the relatively narrow bandwidth of onion? (relatively narrow for gpu, not cpu) Do gpgpu computing typically require a lot of bandwidth?

The GPU uses 4 memory controllers, each with two 32bit wide channels. [1] Jaguar most likely, like Kabini [2], uses a single 64bit wide memory controller. That must limit the bandwidth on Onion.

[1] http://www.amd.com/us/Documents/GCN_Architecture_whitepaper.pdf
[2] http://www.anandtech.com/show/6976/...wering-xbox-one-playstation-4-kabini-temash/4

/edit: Quote from the second article

The major change between AMD’s Temash/Kabini Jaguar implementations as what’s done in the consoles is really all of the unified memory addressing work and any coherency that’s supported on the platforms. Memory buses are obviously very different as well, but the CPU cores themselves are pretty much identical to what we’ve outlined here.

So much for the "based on Kabini implies no hUMA" argument.
 

Panajev2001a

GAF's Pleasant Genius
Yeah,until someone says if really there is a real unified memory address instead of a virtual one we will not know if its hUMA or not.I bet is still virtual like Llano or Cerny would have said the contrary.

It would still be virtual, CPU an GPU would share addresses and the same virtual to physical memory pages mapping. GCN can already use virtual memory and access data outside its directly accessible physical memory through paging, what seems to be required for hUMA is an extension (share the same virtual address space and pointers to it as the CPU) of that not a complete revolution I think.
 

Finalizer

Member
this is one confusing thread :)

Is hUMA used for PS4? or are the jury still out on that?

Will Xbox one have it?

We're probably not gonna get a definitive answer one way or another until these systems launch and the NDAs lift. It seems likely that PS4 supports hUMA in some form at least, since it looks like it's got the parts in place to support it. Jury's out on Xbone, though personally I wouldn't be surprised if it had some sort of solution of its own... But that's a tale straight from my ass, so don't put any stock into that.

Curious about this Xbone APU presentation. Wonder if we'll get any interesting insights out of that.
 

ekim

Member
I checked this table :
http://en.wikipedia.org/wiki/Heterogenous_System_Architecture#AMD_HSA_Implementation

and if I'm not misunderstanding something most of the listed 2013/Kaveri/2014 HSA features are indeed in Xbox One's APU. (Can be validated by a mod if wanted)
- passing pointers between CPU/GPU
- GPU uses pageable system memory via CPU pointers (well that's basically an implication from the above point)
- context switch
- pre-emption (which is basically context switching)

It really seems that only the eSRAM prevents the box from being "hUMA" by AMD's definition:
- Fully coherent memory between CPU & GPU

But MS might have an own solution for this:
from B3D (http://forum.beyond3d.com/showpost.php?p=1777116&postcount=5697)
Nick Baker (Engineer Console Architecture)
Source: http://www.youtube.com/watch?v=vg_DR0leAYw
21:50

"We had to invest a lot in coherency through the chips. There's been I/O coherency for awhile, but we really wanted to get the software out of the mode of managing caches and you know, put in hardware coherency for the first time on a mass scale in the living room on the GPU."

I guess that's what they will talk about in the hot chips session.
 

joshcryer

it's ok, you're all right now
But MS might have an own solution for this:
from B3D (http://forum.beyond3d.com/showpost.php?p=1777116&postcount=5697)


I guess that's what they will talk about in the hot chips session.

He goes on immediately after to mention nested page tables, which is what this is probably about. It sounds like, to me, that they have solution where you can pass a pointer from GPU/CPU at an API level using a page walker, but that's going to come with some overhead, and if the cache needs to be flushed every time, it's going to cost a lot. They may have a cache level solution that keeps it from being flushed.

The entire OS seems to sit in its own VM, which is interesting, and which probably means that a huge chunk of memory is going to be dedicated to the OS.
 

cebri.one

Member
Just a reminder...

http://www.vgleaks.com/durango-memory-system-overview/

There are two types of coherency in the Durango memory system:

Fully hardware coherent
I/O coherent
The two CPU modules are fully coherent. The term fully coherent means that the CPUs do not need to explicitly flush in order for the latest copy of modified data to be available (except when using Write Combined access).

The rest of the Durango infrastructure (the GPU and I/O devices such as, Audio and the Kinect Sensor) is I/O coherent. The term I/O coherent means that those clients can access data in the CPU caches, but that their own caches cannot be probed.

When the CPU produces data, other system clients can choose to consume that data without any extra synchronization work from the CPU.

The total coherent bandwidth through the north bridge is limited to about 30 GB/s.

The CPU requests do not probe any other non-CPU clients, even if the clients have caches. (For example, the GPU has its own cache hierarchy, but the GPU is not probed by the CPU requests.) Therefore, I/O coherent clients must explicitly flush modified data for any latest-modified copy to become visible to the CPUs and to the other I/O coherent clients.

The GPU can perform both coherent and non-coherent memory access. Coherent read-bandwidth of the GPU is limited to 30 GB/s when there is a cache miss, and it’s limited to 10 – 15 GB/s when there is a hit. A GPU memory page attribute determines the coherency of memory access.
 

ekim

Member
Dedicated GPU (in contrast to iGPU: integrated GPU).

That's why I was wondering - but I guess the person in question just did verification tests on APUs/dGPUs and PS4/X1 APUs. I first understood as if these consoles have an APU + dGPU. That was pretty much unbelievable.

Wait what? I thought by definition an APU can't have a dGPU.

Afaik, Richland APU's iGPUs can be used for Crossfire with a dGPU.
edit:nevermind - misreading.
 

KidBeta

Junior Member
I checked this table :
http://en.wikipedia.org/wiki/Heterogenous_System_Architecture#AMD_HSA_Implementation

and if I'm not misunderstanding something most of the listed 2013/Kaveri/2014 HSA features are indeed in Xbox One's APU. (Can be validated by a mod if wanted)
- passing pointers between CPU/GPU
- GPU uses pageable system memory via CPU pointers (well that's basically an implication from the above point)
- context switch
- pre-emption (which is basically context switching)

It really seems that only the eSRAM prevents the box from being "hUMA" by AMD's definition:
- Fully coherent memory between CPU & GPU

But MS might have an own solution for this:
from B3D (http://forum.beyond3d.com/showpost.php?p=1777116&postcount=5697)


I guess that's what they will talk about in the hot chips session.

Could you provide your evidence for the context switch / pre-emption I have yet to read or even hear anything that suggests the XBONE has it.

For the first two points they are a standard feature of GCN, so it would be surprising if the XBONE didn't have them.
 

ElTorro

I wanted to dominate the living room. Then I took an ESRAM in the knee.
Also interesting: Sound processors in Xbox One has a lot of grunt. Cerny said he wants to use GPGPU for sound. Can't wait to see which solution is better.

I don't think that compares. Cerny said that the quite specific use case of raytracing for audio could be done by GPGPU. I don't see how the audio processor in the XB1 could do raytracing since this task depends on the representation of the scene's geometry and, thus, needs fast access to main memory.

In addition, the XB1's audio processor explicitly has "pathways" to integrate calculations performed on the CPU indicating that it won't work well as a general purpose processor. I guess it's there to perform programmable tasks on audio streams. In this respect, we don't really know what the PS4's audio chip can do since all we have are Cerny's two sentences on that issue.
 
It should be noted that the audio processing that Cerny has talked about doing on the GPU is not something you can do on the Xbox One's audio processor. If an Xbox One wanted to do the same audio ray casting it would have to use the GPU, too.

EDIT: Beaten
 

ElTorro

I wanted to dominate the living room. Then I took an ESRAM in the knee.
What is your take on hUMA for Xbox One, guys?

The only difference I can spot is the lack of finegrained GPU cache control in the XB1 and ESRAM/DME. But that was all known before the HotChips presentation from the leaked documents so I don't think that we have gained that much more information.
 

benny_a

extra source of jiggaflops
W!CKED said:
Two compute command processors (ACEs) and most likely two compute queues for Xbox One.
Could it be that they just don't list all of them so they actually have more and the 2x2 they display are just stand-ins?
 

ElTorro

I wanted to dominate the living room. Then I took an ESRAM in the knee.
Could it be that they just don't list all of them so they actually have more and the 2x2 they display are just stand-ins?

If I remember correctly, a 2x2 setup is the standard one in GCN while the added ACEs in the PS4 are among the explicit modifications.
 
Top Bottom