• Hey, guest user. Hope you're enjoying NeoGAF! Have you considered registering for an account? Come join us and add your take to the daily discourse.

Japanese Kutaragi Interview, on PS3, Nvidia, eDram etc.

gofreak

GAF's Bob Woodward
Sorry if old. This is actually part 3 of the Watch Impress Kutaragi interview.

http://pc.watch.impress.co.jp/docs/2005/0613/kaigai189.htm

He talks about why a Cell-based GPU wasn't use, the close bidirectional Cell<->RSX relationship, eDram, unified shaders vs dedicated shaders, why they went with NVidia, backwards compatability and more. Unfortunately I can't see any proper translation yet, so we'll have to make do with the mess that is babelfish's for now.

Things I took from it:

- PS3 backwards compatibility is through software on Cell (perhaps Cell alone). Cell has been made bi-endian to emulate the PSX and PS2 CPUs.

- A 2-Cell PS3 was considered - one acting as a GPU, one as a CPU. But Cell's nature as a CPU eats away at its efficiency for graphics relative to a dedicated GPU. So they've married Cell with a specialised GPU.

- Cell, however, can be used for graphics tasks. Cell and RSX have direct access to the results of each other's operations. Kutaragi specifically mentions displacement mapping as a sample workload the SPEs could take on. I think he mentions something about harmonising floating point rounding errors between the CPU and GPU to further ease data exchange between the two. Seems like they focussed quite a lot on harmonising communication between the GPU and CPU, and binding them quite tightly together.

- Originally there was no VRAM in PS3. The graphics chip accessed only the XDR ram.

- There is no eDram in PS3, because they couldn't put enough on the die to support a 1920x1080 frame let alone 2. Kutaragi thinks the transistor cost of eDram is too high when you consider what the same amount of transistors buys you elsewhere in terms of more shading power.

- He touches on unified shading and mentions something about problems with stalls..I don't know, hard to make out exactly what he's saying there.



The reason which does not adopt the Cell based graphic chip

With PS3 architecture, those where last year you are surprised graphics are to apply with Cell based architecture. Why, not making Cell based GPU?

Kutaragi: 7 SPE of Cell (Synergistic Processor Element) it can use in graphics. Actually, several of the demonstration with E3, when it has not been the graphic processor, has done everything of graphics to rendering with just Cell. But, such using is wasted. There are times when you do more in other things in Cell.

2 placing Cell, (you use Cell of one side) with also the plan which is said it was on the graphic center, but as Cell and Shader as the computer thinking that weight is different from function you stopped. We would like to make the architecture which Shader with, all things thoroughly (in graphics) can specialized Shader. Though, for example, displacing mapping (it does) with it can also say with SPE.

So far (real time 3D graphics), that it seems and shows, (3D graphics) in the space is different really. Even then, with present resolution it was good. The game which has come out with Xbox 360 the majority, is 3D such as that even at present point in time.

But, I would like to make 3D where (metamorphosis and the like is reflected securely in the 3D space). For that, there is the thought as much as possible of liking to share the data (with CPU and GPU). Therefore, the latest architecture was taken. Originally if as for the floating point unit of GPU and Cell, from precision round (making round) until error we would like to make entirely the same. This time, it has become close, almost rather simultaneous. Therefore, each other, it can use (the operational result) bidirectionally.

- EDRAM is stopped because of full HDTV

EDRAM (it installed graphic memory and DRAM) expected, but the HDTV 2 picture hearing, you could agree upon the reason which it does not make eDRAM.

Kutaragi: Originally, there is no graphic memory in GPU, Redwood (the high-speed interface which connects Cell and RSX) with YDRAM (the code name of XDR DRAM) with also correspondence is achieved. Because YDRAM (memory) has become unified.

But, so entering, that there is a problem which you say whether () definite form processing and Shader of graphics (computing) when it does, the distant place (accessing memory) with, it is possible to make (zone and cycle time) wasteful. It is not necessary to use (memory zone), with definite form task of the side of with special care Cell. Because Shader with Shader, calculates the tremendous quantity, there is a necessity of memory even to there. Especially, when with full HDTV, 2k×1k (the 1,920×1,080 dot) being progressive, when 2 pictures or more we would like to handle, it becomes, mass VRAM becomes necessary.

When that happens, eDRAM is unreasonable. As for using eDRAM, at the time of PS2 it is good. Are not enough in so or this time just 2 pictures. We assume that eDRAM of the quantity which (can support HDTV) 200 squares mm or 300 squares mm was inserted in the stone. So, when it does, (it can load onto the tip/chip logic because of the area of eDRAM) at once to decrease, the quantity of Shader decreases. That compared to, in full using in logic, the method which places large quantity Shader is better.

- NVIDIA vision of the processor of the ideal which is shared

In the first place, why uniting with NVIDIA in the GPU vendor?

Kutaragi: So far, we, Toshiba and together did the graphics for computer entertainment, personally. Including also process technology, the finish you did. And, this time, it united with NVIDIA in order to do the computer itself.

The finish pursuing PC graphics, Intel to the processor doing, it is about NVIDIA probably to do with Programable Shader. NVIDIA pursuing function and efficiency as a processor, the &#12487;&#12499;&#12483;&#12489; kirk (Chief Scientist of David B. Kirk and NVIDIA) to include the person, because the developer is graduate person of various computer enterprises such as SGI. There is a character that they, size and the like of the tip/chip do in the air, resign the fact that we would like to do and would like to pursue. Occasionally, there are also times when you do too much, but culture has been similar to me.

The approach of NVIDIA my approach, agrees in the point that finally the finish it will pursue the full programmable processor. &#12472;&#12455;&#12531;&#12475;&#12531; (president and CEO of Jen-Hsun Huang and NVIDIA) and &#12487;&#12499;&#12483;&#12489; to be good there is an opportunity which does story, but a story that the processor of ideal it probably will do that it appears in that time. Ideal, naturally, present PC, well, is the processor which exceeds the present all processor.

They are opposite to the direction steadily, in that sense, us and vision share. It shares also the road map. In addition, they have received influence even from our architecture. Mutual temper having known, because you think, that we would like to do the same thing, it united with NVIDIA.

As for another element, display fixed pixel system (the liquid crystal and the like) the point which is moving. When it becomes fixed pixel system, TV and PC, are times when everything fuses. Therefore, we would like to support everything perfectly.

So it puts out also the downward compatibility of PS, dirty &#12524;&#12460;&#12471;&#12540; (graphics) from to up-to-date Shader would like to support entirely. As for resolution &#12496;&#12471;&#12483; we would like to put out those above WSXGA. When such as that, rather than we making scratch (build) with, the method which the van has all is quicker (from NVIDIA).

As for Microsoft, with Xbox 360 GPU of ATI took Unified-Shader type architecture. With program characteristic Unified-Shader with advanced.

Kutaragi: As for the architecture of ATI, as for Shader however Vertex Shader and Pixel Shader equality (architecture) with, are visible at first glance well with joint ownership, you think that it is difficult. For example, whether the result of doing apex processing is made somewhere, how doing that, in Shader letting flow (for pixel processing) again? When somewhere is plugged, the stall it does entirely. Really it is different from those which are drawn in the picture. If of realistic performance is thought, you think that method of approaching NVIDIA is superior.

PLAYSTATION 3
- As for maintenance of compatibility with combination of hardware and software


As for the compatibility of past PlayStation actualizing with the hardware?

Kutaragi: You take with the combination of the hardware and the software. You try if probably to do (with just the software) how it becomes, but just which drives to perfectly close compatibility, it is important.

As for the person who develops the software unexpected, the fact that you cannot imagine is done. For example, however it is not logical as a program, you said that it moved accidentally. It is moving however, with, that is moving with, is a kind of case which is said in completely another reason. Passing through also our tests, "as for this cord/code which is what! "There are times when the cord/code which we would like to see passes.

We do not take either the compatibility for the cord/code such as that and the &#12390; is not good. ) Just a little it is painful but, because there is no logic, (to take compatibility with just the software. There are also times when hard becomes necessary. If so, this time (PS3) there is a power of extent, as for a certain place the correspondence such as with the software can do the place where it is hard.


When the cord/code of CPU side is emulated with the software, as for the endian of CPU.


Kutaragi: Therefore as for Cell bi- endian, how it becomes.

Xbox 360 takes compatibility almost with just the software. Because they have not produced the tip/chip at the respective company, it is the case that it is not the choices in other things, but how seeing?

Kutaragi: As for Xbox, when the new generation comes to November of this year, as for current Xbox you become the old generation. So when it does, Xbox means to kill by your by your. The only method of rescuing that takes 100% compatibility from first day. So or that it probably cannot commit (Microsoft), technically it is painful.


<Watch Impress Comments below>

- As for SCEI and NVIDIA those which are similar

Hisashi &#22811; well, being the element, directivity of enterprise above transacting the device development and agreement of culture simply is transmitted to the relation of SCEI and NVIDIA, from word of the wooden person. Also both corporations, like original idea, with &#12522;&#12473;&#12463;&#12486;&#12452;&#12459;&#12540;, pursue to cost function and efficiency last. Is not the case, always, but the propensity such as that is strong. In addition, also both corporations, presently agree with the conception which pursues the processor.

In the GPU vendor, NVIDIA directivity to especially programmable conversion is strong. Speaking accurately, it has the direction whose also ATI Technologies and 3Dlabs are strong in programmable conversion, but it was most aggressive in NVIDIA raising general purpose. As for NVIDIA, because of that, the die/di size of GPU (the area of the semiconductor itself) it enlarges, it does not leave either the fact that production cost soars. As SCEI, that the directivity of such NVIDIA, it seems that you thought is faced as a partner.

Presently, GPU stream processing (with the small program piece keeps processing the mass data) has dashed forward to the programmable processor which specializes in stream system. By the fact that the general purpose of Programable Shader which is the operational core is raised, it is the case that it tries to be able to do, also the general-purpose processing other than graphics. It is the idea of the general-purpose processor which places the sub processor which on the one hand, also the basic idea of Cell, optimizes in stream processing. Evolving the general-purpose processor, it makes the structure which faces to the stream type processing which in the future would become important. If it tries saying, SCEI and NVIDIA are approaching to the same goal from another direction. So when you think, SCEI and NVIDIA agreed, are not strange thing with vision. You can understand also the fact that it is the agreement point which is called to the directivity both, the processor of ideal.

Hisashi &#22811; well, with the graphic architecture of PLAYSTATION 3 it can support the fact that several choices were examined from explanation of the wooden person. First, with 1 Cell processor, the plan which can let do graphic processing. SPE which is the data processing processor core of Cell, SIMD (Single Instruction and Multiple Data) has the operational unit of type, can designate the same thing as Programable Shader of SIMD structure similarly basically. But, to be able to let do graphics to Cell, proper thing, it is not realistic because the efficiency of Cell as CPU is shaved.

Next, 2 loading Cell, the plan which uses Cell of one side in graphic exclusive use. Expanding the architecture for graphics Cell, it is presumed in this plan that also the plan which makes SPE for graphic processing was included. In that case, it is presumed that also loading and the like the operational unit for specification processing of graphics probably is done. Though, plan of the Cell based graphic tip/chip, is said that it went out rather at early stage.

By the way, even with present PS3 architecture, it can use Cell in graphic processing. As for Kirk of NVIDIA, with the combination of Cell and RSX, it has made clear that pre- processing and post processing of 3D graphics can be done with SPE of Cell. For example, metamorphosis is done Displacement Mapping which (displacement mapping) and the like also to do on SPE side it is possible the apex data.

SCEI eDRAM (installs in RSX and DRAM) the reason which is not placed, is clear from the picture resolution which as written even in the past, is supported. In addition, in order to actualize high Shader processing performance, thinking that it is not possible to consume the die/di area with eDRAM it is recognized. This took the special graphic architecture which utilizes the wide band of eDRAM, conception differs from Graphics Synthesizer of PS2 fundamentally. If you look at the information which is open with RSX, architecture, NVIDIA color quite is strong.

SCEI with PlayStation 2 solved the problem of compatibility by the fact that it loads the chip set of old PS as the sub processor with the hardware. Because this, unless it makes the hardware base, cannot guarantee almost 100% compatibility. When the hardware emulation is done completely with the software, enormous CPU power becomes necessary. This, like PS2 releasing the content of the hardware, in the machine which the developer that tries can access the resource freely, especially is critical.

Those where presently it is clear are basically "perfectly to have been about to actualize close" compatibility even with PS3. Because of that, as still compatibility in the hardware base the direction, continues even with PS3. However, this time, utilizing the high processing power of Cell, the interchangeability in the software base (emulator) it is taken. That specially Cell was designated as bi- endian the fact that compatibility of the CPU side is taken with Cell it means. At early stage of the cooperation development with IBM, SCEI is conveyed that it requested that bi- endian is necessary because of compatibility. By the way, this time, the compatibility of PS, PS2 and 2 generations is actualized. The both of PS and PS2 has loaded CPU of MIPS architecture.

As always, it's probably best not to read too much into anything until we've a proper translation..the above is just a rough idea of what was being said. A better translation should hopefully be on the way soon..
 

Kleegamefan

K. LEE GAIDEN
Sounds like RSX is only somewhat related to G70, which is not that surprising as RSX still hasn't taped out yet.....

So glad they went with nVidia instead of a Cell-based GPU...
 

Zaptruder

Banned
Someone translate this please!

Although you can kinda get the gist of the article reading babelfish, it's no better than reading the impressions on said babelfished article!
 

Fafalada

Fafracer forever
Well what I got out of it is that they use emulation to some extent (CPUs) but not for everything - which if true would lead me to suspect there's a GS in there.
Either way it's a definite confirmation there's no EE&GS chip in there (which I've always said was not gonna happen).

Getting a second opinion on translation right now though.
 

gofreak

GAF's Bob Woodward
Fafalada said:
Well what I got out of it is that they use emulation to some extent (CPUs) but not for everything - which if true would lead me to suspect there's a GS in there.
Either way it's a definite confirmation there's no EE&GS chip in there (which I've always said was not gonna happen).

Getting a second opinion on translation right now though.

If you know someone you has a better translation, please feel free to share! :)
 

Pimpwerx

Member
A lot has been made of the Xenos' abilities as a GPGPU. Looks like RSX was designed along the same lines too? This based on this machine translation. I hope the one can get us a proper translation today. This has been the most interesting part of the interview IMO. PEACE.
 

Kleegamefan

K. LEE GAIDEN
Hey Faf...they *HAVE* to have a GS in there right?

The since the PS3 doesn't have enough memory bandwidth to emulate the eDRAM on GS then they must have a mini-GS in hardware somewhere to get PS2 compatitiblity, no?

GS eDRAM is 48GB/sec and RSX @ 35GB/sec is still 13GB/sec shy of the GS....
 

gofreak

GAF's Bob Woodward
Pimpwerx said:
A lot has been made of the Xenos' abilities as a GPGPU. Looks like RSX was designed along the same lines too?

The general trend in GPUs is toward more general processing (well, balancing the fine line between more general processing and blazing performance). It's not something specific to RSX or Xenos. What is new with X360 and more particularly PS3 is the amount of bandwidth between the GPU and CPU to make use of the GPU for tasks other than graphics more feasible. I'm not sure how often they'll be used in such a manner though..
 

Fafalada

Fafracer forever
Ok looks like a jumped the gun a bit - it seems he never actually implies what part is software and what part is hardware "emulated", just that they use both (which was true of PS2 also, so that doesn't say a whole lot).
Anyway I can't really get a written translation (I just asked my boss who's Japanese to explain me the parts I wanted to know about).

Ken apparently also feels that ATIs approach looks pretty on paper but doesn't translate into practical benefits so well, and NVidia's way is better in practice. (yes I know it's obvious he'd say that, but that paragraph sounded confusing so I just thought some might be interested in what it actually meant).

KLee said:
Hey Faf...they *HAVE* to have a GS in there right?
Well, no. Depending on the size and configuration of caches in RSX, emulation may or may not be be trivially possible at full speed regardless of external memory bandwith.

And RSX may have a total 35+22GB/sec of external bandwith available also.
 

gofreak

GAF's Bob Woodward
Fafalada said:
And RSX may have a total 35+22GB/sec of external bandwith available also.

Something I've wondered for a while, but would the first figure, 35GB/s, not ultimately be limited by XDR's bandwidth? Or are we differentiating between external bandwidth (regardless of where the data goes) and external bandwidth to memory..?
 

Pimpwerx

Member
Kleegamefan said:
Hey Faf...they *HAVE* to have a GS in there right?

The since the PS3 doesn't have enough memory bandwidth to emulate the eDRAM on GS then they must have a mini-GS in hardware somewhere to get PS2 compatitiblity, no?

GS eDRAM is 48GB/sec and RSX @ 35GB/sec is still 13GB/sec shy of the GS....

RSX can read/write to both pools of memory, so the aggregate read bandwidth would be 22GB/s + 35GB/s = 57GB/s. Then again, I don't know if they'll go that route. I hope that they have either an EE or GS in there just for the possibilities. I assume the 4MB of eDRAM and 1.2GT/s fillrate could be useful for something. PEACE.
 

Drek

Member
What is new with X360 and more particularly PS3 is the amount of bandwidth between the GPU and CPU to make use of the GPU for tasks other than graphics more feasible. I'm not sure how often they'll be used in such a manner though..
Hmm, I'm personally expecting more of the reverse, the CPU being used to supplement the GPU, at least from the PS3. With the 7 SPEs and the fast bi-directional bandwidth Sony can effectively cheat to get noticably better visuals than the X360. They'll have to sacrafice general computing and non-graphics in game operations like physics and AI, but they'll still be miles ahead of last generation. Smart move by Sony if that is their plan, since the average consumer equates graphics with overall system power, and also is much less capable of noticing differences in physics, AI, and other general computing functions.
 

Fafalada

Fafracer forever
gofreak said:
Something I've wondered for a while, but would the first figure, 35GB/s, not ultimately be limited by XDR's bandwidth? Or are we differentiating between external bandwidth (regardless of where the data goes) and external bandwidth to memory..?
Well from what IBM tells us, FlexIO connects to EIB, meaning we can get entire 35GB/s entirely from/to SPE local memories and L2 cache, never touching the XDR bandwith at all.
 

Kleegamefan

K. LEE GAIDEN
A lot has been made of the Xenos' abilities as a GPGPU. Looks like RSX was designed along the same lines too? This based on this machine translation. I hope the one can get us a proper translation today. This has been the most interesting part of the interview IMO. PEACE.

I would think RSX would offer more flexibility as a GPGPU than Xenos...

Xenos only has part of 22GB/sec acces to the Tri-core XeCPUs and even then, it seems like it only has access to the write buffers of the 1MB L2 Cache on XeCPU (correct me if I am wrong, though)

It would seem to me that if you have multiple threads from three cores and the Xenos all fighting for 1MB of L2 cache, that would imply there is going to either be a big fight from different sources (CPU/GPU) for that 1MB or a big task from the developer to schedule things effeciently enough to keep things flowing well....again, this is just my initial impression of things..

With PS3, you have a faster GPU (RSX) with a bigger pipe (35GB/sec) to a CPU that can feed it more data (CELL) *and* RSX has direct access to the 3.5GB SPE local SRAM+the 256MBs of XRDRAM through CELL....seems like a lot more flexibility to me....again, correct me if I am wrong in all of this....
 

gofreak

GAF's Bob Woodward
Fafalada said:
Well from what IBM tells us, FlexIO connects to EIB, meaning we can get entire 35GB/s entirely from/to SPE local memories and L2 cache, never touching the XDR bandwith at all.

This is a very good thing. I was hoping that was the case, and it made sense, but I was never quite sure.

Hmm...possibilities...scene postprocessing on Cell seems very feasible afterall then. In practical terms, RSX can write data directly out to Cell local sram without touching memory at all? What I'm thinking about here is if it's possible for RSX to write the framebuffer directly to Cell local memory without touching memory...(?) I'm thinking that 1.7-2.3MB or local memory would require tiling which might mess up the "don't touch memory" aim..

Drek said:
Hmm, I'm personally expecting more of the reverse, the CPU being used to supplement the GPU, at least from the PS3. With the 7 SPEs and the fast bi-directional bandwidth Sony can effectively cheat to get noticably better visuals than the X360. They'll have to sacrafice general computing and non-graphics in game operations like physics and AI, but they'll still be miles ahead of last generation. Smart move by Sony if that is their plan, since the average consumer equates graphics with overall system power, and also is much less capable of noticing differences in physics, AI, and other general computing functions.

Yeah, it can work that way too, the CPU helping the GPU, and I think that might be more feasible. The beauty of that is, it's completely up to the developer and the aims of their game. RSX on its own will likely deliver at least the same level of fidelity, if not more, as Xenos, but then you have a tonne of "extra" headroom on the CPU-side that can also be tapped if you're graphically inclined to push things even further.
 

Squeak

Member
Maybe my query/suggestion (B3D) as to whether RSX could have 4Mb eDRAM for GS compatibility in PS2 mode, and a texture buffer/tile buffer for PS3 games, will turn out to be true?
 

Kleegamefan

K. LEE GAIDEN
Hmm...possibilities...scene postprocessing on Cell seems very feasible afterall then. In practical terms, RSX can write data directly out to Cell local sram without touching memory at all? What I'm thinking about here is if it's possible for RSX to write the final framebuffer directly to Cell local memory without touching memory...(?)

When you say scene postprocessing, what would that entail exactly..

To me, that implies procedural special effects (Explosions, DOF, particles) but wouldn't they be fillrate bound??

And if so, why not just do it on the GPU??

:confused??
 

MetalAlien

Banned
I'm just glad it's not pure software emulation. It's hard enough to get right when you got parts of the actual hardware in there.
 

gofreak

GAF's Bob Woodward
Kleegamefan said:
When you say scene postprocessing, what would that entail exactly..

To me, that implies procedural special effects (Explosions, DOF, particles) but wouldn't they be fillrate bound??

Things like HDR, bloom, depth of field etc. Open up photoshop and look at filters ;)

Depth of field is one trick for simulating the optics of a camera. But there are more that can be tapped to give an even more realistic look. According to Edge, The Getaway demo was using automatic white balancing and auto-focussing amongst other optical effects to simulate how London might look from a tourist's camcorder..and that certainly gave it a very realisitic and distinctive look, no? And, apparently that whole demo was all being done on Cell, so presumably that included scene postprocessing.

That all said, I'm not sure how much better Cell might be for things like that versus the GPU...perhaps it'd be easier to do some stuff on a CPU versus the confines of a GPU shader or whatever? It may also save some memory bandwidth if such processing is eating internal Cell bandwidth (of which there is a HUGE amount) instead of far more limited (and thus precious) VRAM bandwidth. Even if it really wasn't any better, you could be saving the GPU some work so that it can spend more time doing other things.

These are just thoughts though, I'd appreciate feedback on the feasibility of such stuff..
 

Vince

Banned
Fafalada said:
Ken apparently also feels that ATIs approach looks pretty on paper but doesn't translate into practical benefits so well, and NVidia's way is better in practice. (yes I know it's obvious he'd say that, but that paragraph sounded confusing so I just thought some might be interested in what it actually meant)

This is an interesting question as outside of ATI's PR about how it's more effecient than dedicated resources, it's unknown how effecient a unified shader will be and we have no clue as their is little publically known about the ALU organization and how the data flow is arbitrated.
 

Kleegamefan

K. LEE GAIDEN
That all said, I'm not sure how much better Cell might be for things like that versus the GPU...perhaps it'd be easier to do some stuff on a CPU versus the confines of a GPU shader or whatever? It may also save some memory bandwidth if such processing is eating internal Cell bandwidth (of which there is a HUGE amount) vs far more limited (and thus precious) VRAM bandwidth. Even if it really wasn't any better, you could be saving the GPU some work so that it can spend more time doing other things.

Perhaps with RSX+Cell it doesn't have to be an either/or thing.....perhaps they are designed to be used in conjunction??

Like adding lighting values (angle, absorbtion, diffusion, reflection) to pixel data...I would think you would be able to get some really good GI, Radiosity and subscatter effects if that is the case....
 

gofreak

GAF's Bob Woodward
Kleegamefan said:
subscatter effects

IIRC the Doc Oc demo was using Cell for lighting calculations, including light transmission through the skin and scattering beneath the skin. The possibility of using Cell for generating data for lighting is definitely something interesting..
 

Kleegamefan

K. LEE GAIDEN
Vince said:
This is an interesting question as outside of ATI's PR about how it's more effecient than dedicated resources, it's unknown how effecient a unified shader will be and we have no clue as their is little publically known about the ALU organization and how the data flow is arbitrated.



Yeah, I think efficiency will be the rub with Xenos..... its a big hurdle that would be nice if they overcome, but I am not holding my breath...

What pisses me off somewhat is the fact ATI/MS was presenting Xenos as a GPU that could arbitrarily adapt to whatever vertex/pixel balance a developer would have via the unified shaders…..they even talk about just “keeping all the shaders busy and you will get 100% efficiency” and that would be great but it seems that the structure of Xenos ALUs would make that somewhat unlikely (3 SIMDs with 16 ALUs each so the most pixel or shader ALUs you could have is 32)…

They mention 96 shader ops but that would be 48 pixel ops and 48 vertex ops at the same time, which is not going to happen…

Sony get a bad rap for bullshiting specs but I think the MS/ATI crew could have been a little more forthcoming as well..

Hmm, I'm personally expecting more of the reverse, the CPU being used to supplement the GPU, at least from the PS3. With the 7 SPEs and the fast bi-directional bandwidth Sony can effectively cheat to get noticably better visuals than the X360. They'll have to sacrafice general computing and non-graphics in game operations like physics and AI, but they'll still be miles ahead of last generation. Smart move by Sony if that is their plan, since the average consumer equates graphics with overall system power, and also is much less capable of noticing differences in physics, AI, and other general computing functions.

I think Sony 1st and 2nd party will very much cheat this way...

*IF* (big if) the SPEs can do a majority of the kinds of traditional work the PPE/Xenon XeCPU is doing (AI, Physics, game data, sound) and at a decent speed, they don't have to use all SPEs to match the Tri-core XeCPU....

If CELL can get XeCPU-type performance out of, say, the PPE+4 or 5 SPEs then they can use resources from the other 2-3 SPEs to assist RSX.......if true, that would be significant, IMO....
 

Pimpwerx

Member
gofreak said:
IIRC the Doc Oc demo was using Cell for lighting calculations, including light transmission through the skin and scattering beneath the skin. The possibility of using Cell for generating data for lighting is definitely something interesting..
Here's a little fuel on the fire. ;) NVidia has their Cg shader language. SPEs are able to handle C-code. :) And one thing present in most of the PS3 demos was great lighting. I'm really interested to know if they get SSS and HDR performance to a practical, useable level. That RSX demo from the NVidia guy also highlighted SSS and HDR, so... :) PEACE.
 

Pimpwerx

Member
I think DaveB mentioned that you could have all 48 ALUs doing vertex or pixel work. There was a theory that they could only handle so many intructions per clock, but I think DaveB already debunked that. So you can have 16, 32 or 48 ALUs devoted to a single task at a time. Efficiency should be higher than normal, but I don't think 100%. 3x the granularity won't get you to 100% util. PEACE.
 

Kleegamefan

K. LEE GAIDEN
Shogmaster said:
Err... what.........

*clickity click click*

My calculator says 48........


48 with either 100% vertex op or 100% pixel ops, which will never happen...

Here is what I am saying:

3 SIMD ALU engines

16 ALUs per engine for a total of 48 ALUs

Game would use either 2 SIMDs for pixel ops (32 ALUs) and 1 SIMD for vertx ops (16 ALUs) or 2 SIMDs for vertex ops (32 ALUs) and 1 SIMD for pixel ops (16)

The max you would used for vertex or pixel ops at any one time is 32, it seems, not 48 or some arbitrary portion of that, as we were first led to believe...


Pimpwerx said:
I think DaveB mentioned that you could have all 48 ALUs doing vertex or pixel work. There was a theory that they could only handle so many intructions per clock, but I think DaveB already debunked that. So you can have 16, 32 or 48 ALUs devoted to a single task at a time. Efficiency should be higher than normal, but I don't think 100%. 3x the granularity won't get you to 100% util. PEACE.


And just like that, I'm shot down :D
 

gofreak

GAF's Bob Woodward
Shogmaster said:
Err... what.........

*clickity click click*

My calculator says 48........

I think he means is that the most you could have working on pixel or vertices is 32..16+16 on pixels and 16 on vertices or 16+16 on vertices and 16 on pixels. Of course theoretically you could have all 48 on either, but I'm not sure how often that'd happen.

I'm also not sure how granular the architecture is in terms of how work is split. It may be split at the 3 SIMD engines, but there's some suggestion that internally there are 2 clusters of 8 ALUs in each engine, and those clusters can work on either pixel or vertex. But it's not arbitrarily granular, no. You can't have, say 15 ALUs working on vertices and the balance working on pixels. You have to assign on a per SIMD engine or per "cluster" level. (Not sure yet which).

Kleegamefan said:
They mention 96 shader ops but that would be 48 pixel ops and 48 vertex ops at the same time, which is not going to happen

It's 48 vector and 48 scalar ops. 1 vector op and 1 scalar op per cycle per ALU. If your using your vector op on an ALU for vertices (or pixels), you have to use the scalar op for vertices (or pixels) too, if you're going to use it at all. But if you find a use for every execution unit, scalar or vector, then yeah, you'd be getting 96 shader ops. Everyone counts a vector or scalar op as a shader op, so it's fair.
 

gofreak

GAF's Bob Woodward
CrimsonSkies said:
So am I suppose to believe Sony over ATI on ATI's design?

No. Obviously Sony/NVidia have one viewpoint on things, ATi/MS has another. Both give credible opinions (well, FUD aside anyway..specifically I'm thinking here just in terms of unified shading vs dedicated shading). Who's right? Who's wrong? For now, there is no right or wrong, since we're still in the realm of the theoretical and can only talk in terms of design choices (rather than good or bad choices). We'll see who was closer to the truth when hardware is out in the hands of independents who can talk about it.
 
Pimpwerx said:
I think DaveB mentioned that you could have all 48 ALUs doing vertex or pixel work. There was a theory that they could only handle so many intructions per clock, but I think DaveB already debunked that. So you can have 16, 32 or 48 ALUs devoted to a single task at a time. Efficiency should be higher than normal, but I don't think 100%. 3x the granularity won't get you to 100% util. PEACE.


100% of anything is wishful thinking, and I don't think MS/ATi is literally promising 100% efficiency.

Having said that, since Xenos can, per cycle, switch from vertex ops to pixel ops for the 3 SIMD units (I don't think they can be split up to do vert and pix ops per cycle), depending on the vert/pix ops requirements for the scene (which for the argument say lasts for 1/60th of a second), it would be vastly more efficient than the current set ups since it can theoretically shift from vertex ops to pixel ops (or visa versa) 8.3 million times in order to achieve the best balance of ops for that scene.

I mean it would be rediculous to assume that the vert/pix ops requirement for any particualr scene would be exactly divided like it is on the graphics card, right? And to expand that for every scene.... You know what I mean.
 

Kleegamefan

K. LEE GAIDEN
Question:

So with Xenos, your vertex/pixel mix can be:

0+48
or
16+32
or
32+16
or
48+0


Is this all correct??


So if you ever encounter a 0+48 or 48+0 balance, how would you be able to produce the vertex/pixel ops you lack?

Via XeCPU??
 
gofreak said:
I think he means is that the most you could have working on pixel or vertices is 32..16+16 on pixels and 16 on vertices or 16+16 on vertices and 16 on pixels. Of course theoretically you could have all 48 on either, but I'm not sure how often that'd happen.

I'm also not sure how granular the architecture is in terms of how work is split. It may be split at the 3 SIMD engines, but there's some suggestion that internally there are 2 clusters of 8 ALUs in each engine, and those clusters can work on either pixel or vertex. But it's not arbitrarily granular, no. You can't have, say 15 ALUs working on vertices and the balance working on pixels. You have to assign on a per SIMD engine or per "cluster" level. (Not sure yet which).



It's 48 vector and 48 scalar ops. 1 vector op and 1 scalar op per cycle per ALU. If your using your vector op on an ALU for vertices (or pixels), you have to use the scalar op for vertices (or pixels) too, if you're going to use it at all. But if you find a use for every execution unit, scalar or vector, then yeah, you'd be getting 96 shader ops. Everyone counts a vector or scalar op as a shader op, so it's fair.

I thought everyone knew that you can't split up the ALUs for vert/pix ops like that! It was specifically mentioned in one of the many many articles I on the Xenos.

AFAIK, it's all or nothing operations for the 3 SIMD units with the 48 ALUs. They all have to be doing vert ops, or they all have to be doing pix ops. What's dynamic about it is that per cycle it can switch the ALUs from doing one or the other automatically, depending on the requriements of the scene being rendered.


Kleegamefan said:
Question:

So with Xenos, your vertex/pixel mix can be:

0+48
or
16+32
or
32+16
or
48+0


Is this all correct??


So if you ever encounter a 0+48 or 48+0 balance, how would you be able to produce the vertex/pixel ops you lack?

Via XeCPU??

I remember reading distinctly that it's per cycle shifting between the ops, not spliting up the ALUs.
 

gofreak

GAF's Bob Woodward
Shogmaster said:
I mean it would be rediculous to assume that the vert/pix ops requirement for any particualr scene would be exactly divided like it is on the graphics card, right? And to expand that for every scene.... You know what I mean.

This is true, and this is what unified shaders are aimed at. But the instruction mix being sent to the card doesn't have to be "dumb" I don't think, it could be managed on some levels (though not all, obviously..you can't always know what the camera is looking at, but you can design for what is in view most of the time). The split enshrined in fixed hardware isn't arbitrary either..it's arrived at after pretty close examination of "typical" instruction mixes, so it is biased towards the common case.

The argument surrounding all of this is the tradeoff of efficiency on one level (within each ALU) for better efficiency/utilisation on another level (between ALUs). Specifically between Xenos and RSX, one also has to consider that it looks like there is more shading silicon on RSX compared to Xenos (as indicated by the higher publicised shader op count, and the difference in silicon budgets when you take out the eDram on Xenos), so not only does the higher utilisation on Xenos have to overcome some unknown level of inefficiency within its ALUs compared to RSX's shaders, it has to be high enough to match or exceed a possibly greater amount of shading power potentially working at a lower level of utilisation, but higher clock rate. And of course, that's likely a gross simplification as is..


Shogmaster said:
I thought everyone knew that you can't split up the ALUs for vert/pix ops like that! It was specifically mentioned in one of the many many articles I on the Xenos.

I know, but very originally, when I first heard about the idea, I thought of it as a pool of execution units that could adapt arbitrarily to the instructions coming in. Obviously things are a little different with this first implementation, but how and ever..

Shogmaster said:
AFAIK, it's all or nothing operations for the 3 SIMD units with the 48 ALUs. They all have to be doing vert ops, or they all have to be doing pix ops. What's dynamic about it is that per cycle it can switch the ALUs from doing one or the other automatically, depending on the requriements of the scene being rendered.

I know it's been reported on a couple of sites that ALL the ALUs have to be doing either vertex or pixel ops at any one time, but I'm not sure if that's the case. Some of the guys at B3D seem sure it's broken up on the per "SIMD engine" or per "cluster" level within each cycle.
 
Since when are we using some arbitaray guesses of random B3D members over information passed on from actual ATi engineers to tech site reporters?

It seems clear to me that the ALUs can't be split up. That articles (or any others) does not mince words in that regard.
 

gofreak

GAF's Bob Woodward
Shogmaster said:
Since when are we using some arbitaray guesses of random B3D members over information passed on from actual ATi engineers to tech site reporters?

It seems clear to me that the ALUs can't be split up. That articles (or any others) does not mince words in that regard.

A number of articles on Xenos post-E3 have contained factual errors, IIRC. Or at least points of confusion. Remember the 96bn shader op per sec info from Hardocp? True, on this point it seems hard to see how they could have arrived at this conclusion accidently, but I'd prefer further clarification. Hopefully the beyond3d article will be a ray of light, and a non-bugged one at that.
 
of course not even Xenos will ever be 100% efficient. if ATI-MS has achived 90 to 95% efficiency, that would be a real improvement over 70% or less of current GPUs.
 
gofreak said:
A number of articles on Xenos post-E3 have contained factual errors, IIRC. Or at least points of confusion. Remember the 96bn shader op per sec info from Hardocp? True, on this point it seems hard to see how they could have arrived at this conclusion accidently, but I'd prefer further clarification. Hopefully the beyond3d article will be a ray of light, and a non-bugged one at that.

I don't know dude. I think on this point, we don't have to wait for Dave B's article. Pretty clear cut. Another thing is, I will have to say that the per cycle switching of shader ops for Xenos is as close to 100% efficiency in that regard as we could conceivably get.

And I also don't know why so many at B3D (and certainly here) are quick to question ATi's contention on the Xenos, yet are no where as crictial on nVidia's claims on the RSX. Shit, the only benefit RSX has going for it at this point IMO is that almost nothing is known about it. Yet, many are so quick to accpet it's claims based on rediculous pre-rendered BS.

Also, Some of you will have to keep in mind that there are 500,000,000 cycles per second in which the Xenos can do it's work. That's 8,333,333 cycles within 60th of a second to balance between vertex and pixel ops. Even if Xenos had to seperate the ops between the 48 ALUs (32:16, 16:32 whatever), that's plenty of opportunities to balance the shader ops approaching 100%. Don't think in such 2 dimensional terms, Klee. :p
 

gofreak

GAF's Bob Woodward
Shogmaster said:
I don't know dude. I think on this point, we don't have to wait for Dave B's article. Pretty clear cut. Another thing is, I will have to say that the per cycle switching of shader ops for Xenos is as close to 100% efficiency in that regard as we could conceivably get.

Perhaps, but 100% will never happen. If during a vertex or pixel cycle there isn't enough data ready to fully feed all the ALUs for that cycle, for example, some may go idle. Which is why better utilisation might be possible if the ALUs can be split within a cycle too.

How does the switching work? I assume the programmer has control over when it switches?

Shogmaster said:
And I also don't know why so many at B3D (and certainly here) are quick to question ATi's contention on the Xenos, yet are no where as crictial on nVidia's claims on the RSX. Shit, the only benefit RSX has going for it at this point IMO is that almost nothing is known about it. Yet, many are so quick to accpet it's claims based on rediculous pre-rendered BS.

It's natural that a more critical eye will be cast on ATi's offering, since they're the ones making bolder claims, architecturally anyway. What claims, specifically, do you think are going under-examined as far as RSX is concerned? As you say yourself, so little of it has been disclosed. But some of the stuff that has been talked about has come under sceptical scrutiny - for example, the usefulness of 128-bit blending on the framebuffer.

BTW, I'm quite hopeful both NVidia and Sony will open up about RSX after NVidia unveils their next-gen (next week?). I think NVidia is the one holding up information release on RSX, not Sony, in order to maintain competitive secrecy for their PC parts. For example, at E3, David Kirk was happy to talk about the relationship between Cell and RSX, but not specifics on RSX itself - and we got no detail at all on the internal makeup of RSX either. I think Sony was ready to go to talk about everything (they've already talked very openly about Cell anyway), but NVidia asked to hold back on the specifics until they were ready to discuss their PC parts..the timing for the latter was a bit more sensitive. ATi doesn't have that issue with Xenos because it's on a completely seperate track from their upcoming PC parts, so they've been able to discuss it in more depth immediately. Fingers crossed there'll be a bit more information flowing about in a week or so..
 

kaching

"GAF's biggest wanker"
Shogmaster said:
Don't think in such 2 dimensional terms, Klee. :p
Now, just make sure developers get the same message. Not going to amount to much if Klee is the only one getting the message ;)
 
gofreak said:
Perhaps, but 100% will never happen.

One of the first thing I mentioned in this thread is that the notion of 100% anthing is rediculous.

If during a vertex or pixel cycle there isn't enough data ready to fully feed all the ALUs for that cycle, for example, some may go idle. Which is why better utilisation might be possible if the ALUs can be split within a cycle too.

It seems like the X360 data flow is plenty efficient to keep the ALUs fed. And that 8.3M cycles for 60fps is enough to waste some cycle for waiting for data.

How does the switching work? I assume the programmer has control over when it switches?

Apparently no. It's all balanced automatically (see articles).

It's natural that a more critical eye will be cast on ATi's offering, since they're the ones making bolder claims. What claims, specifically, do you think are going under-examined as far as RSX is concerned? As you say yourself, so little of it has been disclosed. But some of the stuff that has been talked about has come under sceptical scrutiny - for example, the usefulness of 128-bit blending on the framebuffer.

I can understand certain claims are a bit much to digest at once, but overall sense I get is that ATI has everything to prove, and nVidia has none to prove. Tell me if I'm off on that.

Anyways, seeing how Xenos has 330M trannies (250M for main shader unit + 20M for the AA, Stencil, Z sort ROP on the daughter and 80M for 10MB of EDRAM minus the ROP on the daughter die), and RSX has 300M (260M for next gen part and 40M for GS perhaps?), It's hard to swallow RSX having 50% more rendering power than Xenos BS Sony and nVidia is throwing around, especially looking at the efficiency built into Xenos.
 

gofreak

GAF's Bob Woodward
Shogmaster said:
It seems like the X360 data flow is plenty efficient to keep the ALUs fed. And that 8.3M cycles for 60fps is enough to waste some cycle for waiting for data.

Well that's the point, wasted cycles will eat into your efficiency. But I appreciate now you're not saying 100% utilisation.


Shogmaster said:
Apparently no. It's all balanced automatically (see articles).

Hmmm...if the hardware gets it wrong, it could be messy. It'd be nice if the dev could control it or at least give the hardware hints.

How does the hardware decide what to do for this cycle? What if 51% of the instructions it can see are vertex and 49% are pixel...which does it choose? If the proportions are awkward, there may not be enough of either pixel or vertex ops to keep all the ALUs busy for that cycle, you could have a lot of waste in such situations. That's where splitting ALUs between vertex and pixel within a cycle makes sense. I guess the GPU can resolve dependencies and reorder instructions to try and maximise the number of instructions going through each cycle? I guess it'd also help if the programmer fed it intelligently, but it could be more useful again if the programmer had some control over it all.

Shogmaster said:
I can understand certain claims are a bit much to digest at once, but overall sense I get is that ATI has everything to prove, and nVidia has none to prove. Tell me if I'm off on that.

Well I don't think it's that their claims are hard to digest ("100% efficiency" aside ;)), just that as always there's the story they'll tell you, and then the rest of the story. As with all companies. ATi's story, so to speak, is newer, so there's more to figure out and question.
 
Top Bottom