Support NeoGAF

ekim · Feb 6, 2013

http://www.vgleaks.com/world-exclusive-durangos-move-engines/

Moore’s Law imposes a design challenge: How to make effective use of ever-increasing numbers of transistors without breaking the bank on power consumption? Simply packing in more instances of the same components is not always the answer. Often, a more productive approach is to move easily encapsulated, math-intensive operations into hardware.

The Durango GPU includes a number of fixed-function accelerators. Move engines are one of them.

Durango hardware has four move engines for fast direct memory access (DMA)

This accelerators are truly fixed-function, in the sense that their algorithms are embedded in hardware. They can usually be considered black boxes with no intermediate results that are visible to software. When used for their designed purpose, however, they can offload work from the rest of the system and obtain useful results at minimal cost.

The following figure shows the Durango move engines and their sub-components.

The four move engines all have a common baseline ability to move memory in any combination of the following ways:

From main RAM or from ESRAM
To main RAM or to ESRAM
From linear or tiled memory format
To linear or tiled memory format
From a sub-rectangle of a texture
To a sub-rectangle of a texture
From a sub-box of a 3D texture
To a sub-box of a 3D texture

The move engines can also be used to set an area of memory to a constant value.

DMA Performance

Each move engine can read and write 256 bits of data per GPU clock cycle, which equates to a peak throughput of 25.6 GB/s both ways. Raw copy operations, as well as most forms of tiling and untiling, can occur at the peak rate. The four move engines share a single memory path, yielding a total maximum throughput for all the move engines that is the same as for a single move engine. The move engines share their bandwidth with other components of the GPU, for instance, video encode and decode, the command processor, and the display output. These other clients are generally only capable of consuming a small fraction of the shared bandwidth.

The careful reader may deduce that raw performance of the move engines is less than could be achieved by a shader reading and writing the same data. Theoretical peak rates are displayed in the following table.

Copy Operation Peak throughput using move engine(s) Peak throughput using shader
RAM ->RAM 25.6 GB/s 34 GB/s
RAM ->ESRAM 25.6 GB/s 68 GB/s
ESRAM -> RAM 25.6 GB/s 68 GB/s
ESRAM -> ESRAM 25.6 GB/s 51.2 GB/s

The advantage of the move engines lies in the fact that they can operate in parallel with computation. During times when the GPU is compute bound, move engine operations are effectively free. Even while the GPU is bandwidth bound, move engine operations may still be free if they use different pathways. For example, a move engine copy from RAM to RAM would not be impacted by a shader that only accesses ESRAM.

Generic lossless compression and decompression

One move engine out of the four supports generic lossless encoding and one move engine supports generic lossless decoding. These operations act as extensions on top of the standard DMA modes. For instance, a title may decode from main RAM directly into a sub-rectangle of a tiled texture in ESRAM.

The canonical use for the LZ decoder is decompression (or transcoding) of data loaded from off-chip from, for instance, the hard drive or the network. The canonical use for the LZ encoder is compression of data destined for off-chip. Conceivably, LZ compression might also be appropriate for data that will remain in RAM but may not be used again for many frames—for instance, low latency audio clips.

The codec employed by the move engines is LZ77, the 1977 version of the Lempel-Ziv (LZ) algorithm for lossless compression. This codec is the same one used in zlib, glib and other standard libraries. The specific standard that the encoder and decoder adhere to is known as RFC1951. In other words, the encoder generates a compliant bit stream according to this standard, and the decoder can decompress certain compliant bit streams, and in particular, any bit stream generated by the encoder.

LZ compression involves a sliding window and operates in blocks. The window represents the history available to pattern-match against. A block denotes a self-contained unit, which can be decoded independently of the rest of the stream. The window size and block size are parameters of the encoder. Larger window and block sizes imply better compression ratios, while smaller sizes require less calculation and working memory. The Durango hardware encoder and decoder can support block sizes up to 4 MB. The encoder uses a window size of 1 KB, and the decoder uses a window size of 4 KB. These facts impose a constraint on offline compressors. In order for the hardware decoder to interpret a compressed bit stream, that bit stream must have been created with a window size no larger than 4 KB and a block size no larger than 4 MB. When compression ratio is more important than performance, developers may instead choose to use a larger window size and decode in software.

The LZ decoder supports a raw throughput of 200 MB/s compressed data. The LZ encoder is designed to support a throughput of 150-200 MB/s for typical texture content. The actual throughput will vary depending on the nature of the data.

JPEG decoding

The same move engine that supports LZ decoding also supports JPEG decoding. Just as with LZ, JPEG decoding operates as an extension on top of the standard DMA modes. For instance, a title may decode from main RAM directly into a sub-rectangle of a tiled texture in ESRAM. The move engines contain no hardware JPEG encoder, only a decoder.

The JPEG codec used by the move engine is known as ISO/IEC 10918-1, which was the 1994 JPEG committee standard. The hardware decoder does not support later standards, such as JPEG 2000 (wavelet encoding) or the format known variously as JPEG XR, HD Photo, or Windows Media Photo, which added a number of extensions to the base algorithm. There is no native support for grayscale-only textures or for textures with alpha.

The move engine takes as input an entire JPEG stream, including the JFIF file header. It returns as output an 8-bit luma (Y or brightness) channel and two 8-bit subsampled chroma (CbCr or color) channels. The title must convert (if desired) from YCbCr to RGB using shader instructions.

The JPEG decoder supports both 4:2:2 and 4:2:0 subsampling of chroma. For illustration, see Figures 2 and 3. 4:2:2 subsampling means that each chroma channel is ½ the resolution of luma in the x direction, which implies a footprint of 2 bytes per texel. 4:2:0 subsampling means that each chroma channel is ½ the resolution of luma in both the x and y directions, which implies a footprint of 1.5 bytes per texel. The subsampling mode is a property of the compressed image, specified at encoding time.

In the case of 4:2:2 subsampling, the luma and chroma channels are interleaved. The GPU supports special texture formats (DXGI_FORMAT_G8R8_G8B8_UNORM) and tiling modes to allow all three channels to be fetched using a single instruction, even though they are of different resolutions.

JPEG decoder output, 4:2:2 subsampled, with chroma interleaved.

In the case of 4:2:0 subsampling, the luma and chroma channels are stored separately. Two fetches are required to read a decoded pixel—one for the luma channel and another (with different texture coordinates) for the chroma channels.

JPEG decoder output, 4:2:0 subsampled, with chroma stored separately.

Throughput of JPEG decoding is naturally much less than throughput of raw data. The following table shows examples of processing loads that approach peak theoretical throughput for each subsampling mode.

Peak theoretical rates for JPEG decoding.

System and title usage

Move engines 1, 2 and 3 are for the exclusive use of the running title.

Move engine 0 is shared between the title and the system. During the system’s GPU time slice, the system uses move engine 0. During the title’s GPU time slice, move engine 0 can be used by title code. It may also be used by Direct3D to assist in carrying out title commands. For instance, to complete a Map operation on a surface in ESRAM, Direct3D will use move engine 0 to move that surface to main memory.

Some tables aren't properly formatted. Visit the source.

Master_JO · Feb 6, 2013

it's all gibberish to me.

Someone explain it to me/us please

I wanna know what those can/cant do.

break it down for me

Thanks.

Kill Your Masters · Feb 6, 2013

So in layman terms is this the jizz or not?

reptilescorpio · Feb 6, 2013

Peter Moore made quite the mark it seems!

KidBeta · Feb 6, 2013

Hellraizer said:
So in layman terms is this the jizz or not?

Hardware compression is nice.

But this is not jizz.

Kydd BlaZe · Feb 6, 2013

The numbers mason...what do they mean?

rakka · Feb 6, 2013

is this kinect

KennyLinder · Feb 6, 2013

How many GAMECUBE's?

Bitmap Frogs · Feb 6, 2013

So these are gonna be this gen's SPE's eh?

Can't wait for the endless threads about what they can or cannot do, etc etc.

Hydrargyrus · Feb 6, 2013

I don't know how to eat this sauce...

davious88 · Feb 6, 2013

So basically, they take some load off the CPU/GPU.

KidBeta · Feb 6, 2013

Bitmap Frogs said:
So these are gonna be this gen's SPE's eh?

Can't wait for the endless threads about what they can or cannot do, etc etc.

They cant do anything aside from what was posted

They are literally fixed function

They are not programmable, nothing like a SPE.

Snowden's Secret · Feb 6, 2013

So how many gens more is this compared to the 360? The 360 clocked at 8.7 gens, how does this stack up?

Captain Tuttle · Feb 6, 2013

Subscribing but waiting until someone can break this down into layman's terms

derFeef · Feb 6, 2013

Basically optimization as expected in a closed system? Taking off load is nice, hardware de/compression is also nice I guess.

Slayer-33 · Feb 6, 2013

Captain Tuttle said:
Subscribing but waiting until someone can break this down into layman's terms

Kagebunshin without splitting power? lol

szaromir · Feb 6, 2013

Ashes · Feb 6, 2013

jpeg? Shouldn't folks move to jpeg 2000 already?

systemfehler · Feb 6, 2013

Did I read this right but basicly "move engines" only make sense if you have 2 pools of memory so you can swap while the GPU/CPU is busy with something else? eg. PS4 wouldn't benefit from a "move engine" because the CPU/GPU output both to the same pool of memory.

gofreak · Feb 6, 2013

Looks like the DMA units we expected. No more, no less.

Only thing that seems slightly odd is that they can't saturate the system's bandwidth, although I guess the idea is to use them for some copying around of data but not all.

Ding-Ding · Feb 6, 2013

Is there an English translation incoming.

Basically, is this wizard juice or a premature ejaculation

gaming_noob · Feb 6, 2013

fritolay · Feb 6, 2013

Does this happen behind the scenes to game programmers or is this something that will make coding harder for Durango to take advantage of?

KidBeta · Feb 6, 2013

gofreak said:
Looks like the DMA units we expected. No more, no less.

Only thing that seems slightly odd is that they can't saturate the system's bandwidth, although I guess the idea is to use them for some copying around of data but not all.

the lack of JPEG XR has me puzzled.

Better compression and its also a microsoft owned format.

Bitmap Frogs · Feb 6, 2013

KidBeta said:
They cant do anything aside from what was posted

They are literally fixed function

They are not programmable, nothing like a SPE.

That's not the point.

McHuj · Feb 6, 2013

gofreak said:
Looks like the DMA units we expected. No more, no less.

Only thing that seems slightly odd is that they can't saturate the system's bandwidth, although I guess the idea is to use them for some copying around of data but not all.

Yup, pretty much. No special sauce.

KidBeta · Feb 6, 2013

The four move engines share a single memory path, yielding a total maximum throughput for all the move engines that is the same as for a single move engine.

Wait.

So the max at any one point in time that these can process is 25.6GB/s?.

so if you use all 4. you get (4 / 25.6) 6.4GB/s on each?.

Thats not a great deal of bandwidth tbh.

iamshadowlark · Feb 6, 2013

fritolay said:
Does this happen behind the scenes to game programmers or is this something that will make coding harder for Durango to take advantage of?

Probably hand to hand code this stuff. Alot of juggling it seems.

Anyway I'll C/P my last post

They are fixed function DMA units. Four of them total with 3 fully dedicated to the application(game) and one shared between the system and app. Max throughput of 25.6GBs and can be used in parallel. Where it gets weird is the comment about how all four combine for the max throughput of one.

Any thoughts on this

deanos · Feb 6, 2013

i read the whole thing twice, there is no wizard jizz.
take cover, proelite is gonna be pissed.

Perkel · Feb 6, 2013

It seems that is is what we expected. It should help to achieve better bandwidth overall.

Problem is that now developers will need to juggle data a lot. Overhead ?

derFeef · Feb 6, 2013

deanos said:
i read the whole thing twice, there is no wizard jizz.
take cover, proelite is gonna be pissed.

Is this going to keep popping up with every Durango leak?

spwolf · Feb 6, 2013

Ashes1396 said:
jpeg? Shouldn't folks move to jpeg 2000 already?

nobody is using jpeg 2000... so it is LZ77 and not deflate... very old school but i guess thats expected, most game engines use either that or deflate still due to speed.

gaming_noob · Feb 6, 2013

derFeef said:
Is this going to keep popping up with every Durango leak?

He has an unhealthy obsession with proelite. Should probably seek a therapist.

pharmboy044 · Feb 6, 2013

So does this tell you why people were adding bandwidths to come up with 170GB/s?

Chev · Feb 6, 2013

Hellraizer said:
So in layman terms is this the jizz or not?

In layman's terms you could find such functions hardwired on the DS already. It does mean, though, they're making some things easier for developers, which always is a nice thing.

DGRE · Feb 6, 2013

Wrong move?

Glorified G · Feb 6, 2013

derFeef said:
Is this going to keep popping up with every Durango leak?

Slayer-33 · Feb 6, 2013

pharmboy044 said:
So does this tell you why people were adding bandwidths to come up with 170GB/s?

Isn't it between the Esram and DDR3?

Shikoro · Feb 6, 2013

The only thing you're going to get with this is headaches for the developers. Judging by everything we know so far, Orbis is definitely more powerful. Deal with it. :/

Thraktor · Feb 6, 2013

Chev said:
In layman's terms you could find such functions hardwired on the DS already. It does mean, though, they're making some things easier for developers, which always is a nice thing.

DMA has been in consoles in one form or a another since the NES, iirc. Not that it's not useful, but it's not a revolutionary feature by any means.

iamshadowlark · Feb 6, 2013

pharmboy044 said:
So does this tell you why people were adding bandwidths to come up with 170GB/s?

Not at all. Its still pretty disengienous. You don't add bandwidths..

Thraktor · Feb 6, 2013

iamshadowlark said:
Not at all. Its still pretty disengienous. You don't add bandwidths..

You can certainly add bandwidths, but only under the assumption that you can saturate both simultaneously (which may or may not be the case on Durango), and that the developer has divided data between the two pools efficiently.

KidBeta · Feb 6, 2013

Some quick maths assuming 100% peak rate and also assuming 0 usage by other parts of the system. So pretty much ideal numbers which will never be reached.

For a game running at 30FPS they can move 873.81MB/Frame.
For a game running at 60FPS they can move 436.90MB/Frame.

Chev · Feb 6, 2013

Thraktor said:
DMA has been in consoles in one form or a another since the NES, iirc. Not that it's not useful, but it's not a revolutionary feature by any means.

it was more the jpeg/lz decompression stuff I was talking about. DMA is, indeed, completely common .

mrklaw · Feb 6, 2013

Chev said:
it was more the jpeg/lz decompression stuff I was talking about. DMA is, indeed, completely common .

can't current GPUs just use compressed textures directly from ram? so the ability to decode from a compressed storage to a texture isn't new, its just something they'd need to support anyway?

Ashes · Feb 6, 2013

spwolf said:
nobody is using jpeg 2000... so it is LZ77 and not deflate... very old school but i guess thats expected, most game engines use either that or deflate still due to speed.

I doubt that's true for all developers. But I guess this takes away the choice for X3 developers.

sleeping_dragon · Feb 6, 2013

Whats the DBZ power of these?

szaromir · Feb 6, 2013

mrklaw said:
can't current GPUs just use compressed textures directly from ram? so the ability to decode from a compressed storage to a texture isn't new, its just something they'd need to support anyway?

Yeah, DMEs just offload processing from GPU.

iamshadowlark · Feb 6, 2013

Thraktor said:
You can certainly add bandwidths, but only under the assumption that you can saturate both simultaneously (which may or may not be the case on Orbis), and that the developer has divided data between the two pools efficiently.

But the it doesn't look like thats all that possible on Durango either.

cyberheater · Feb 6, 2013

KidBeta said:
Some quick maths assuming 100% peak rate and also assuming 0 usage by other parts of the system. So pretty much ideal numbers which will never be reached.

For a game running at 30FPS they can move 873.81MB/Frame.
For a game running at 60FPS they can move 436.90MB/Frame.

That's a lot of data per frame. More then enough.

Support NeoGAF

VGLeaks: Durango's Move Engines

Member

Banned

Member

Member

Junior Member

Member

Member

Member

Mr. Community

Member

Banned

Junior Member

Banned

Member

Member

Liverpool-2

Banned

Banned

Member

GAF's Bob Woodward

Member

Member

Member

Junior Member

Mr. Community

Member

Junior Member

Banned

Banned

Banned

Member

Member

Member

Member

Member

Banned

Member

Liverpool-2

Member

Member

Banned

Member

Junior Member

Member

MrArseFace

Banned

Banned

Banned

Banned

PS4 PS4 PS4 PS4 PS4 PS4 PS4 PS4 PS4 PS4 PS4 PS4 PS4 PS4 PS4 PS4 PS4 Xbone PS4 PS4

Similar threads