Its also a fallacy to see the division between the regions of PS3 memory as being rigid barriers, the problem was what happened to load/store cycle times when data was accessed in the wrong way by the wrong piece of hardware - essentially it demands very specific data paths as going the wrong way exacts a hideous performance penalty and/or stalling. ...
True. But it's the programmers job to know the system he is working on.
What irritates me quite a bit is how you talk about uniform memory system
vs. non-uniform system in some of your recent posts, while stating somehow
that the former is better than the latter. So I will take on this for the
rest of the post.
The point you've given above doesn't justify the point that a uniform memory
system is any better. If anything, it's better from a programmer's point of
view, it eases the programming model, but it's not better in terms of
overall efficiency, i. e. in scalability, latency, and bandwidth on
multicore systems.
Having each processor on a different bus to its own local memory solves the
uniform memory bus contention problem. Further, you better bring the memory
very close to the processor to reduce latency, see SPEs local store, and to
better not introduce any control logic on them like for example a cache
logic. And since the memory lives on different busses in a non-uniform
memory system one has to use explicit communication (DMA) to access data
from another processor's memory. This usually makes programming such a
system more complex at first instance. But if you look at it, you will see
that the computational efficiency of large problems that one wants to solve
(esp. for computer graphics, physical simulations, etc. ) do heavily depend
on the data layout and memory transfer. So you better be in full control of
how your data is layered and how your data will be transfered throughout the
system. One has to program for the data to gain more computational
efficiency, which seems to be a new concept to many people esp. those coming
from the PCs era only. The induced latency to access non-local memory in a
non-uniform memory system can often be hided by a technique called
multi-buffering, i. e. you DMA'ing in the next data while the processor
operates on the current one. However, this also happens on uniform memory
system. Guess for example why Intel has introduced the prefetch assembler
instructions? Hence, explicit memory transfer (DMA'ing) lets you adapt the
dataflow of your problem much better on a multicore system, which, in
general, will yields higher computational efficiency - if done right,
programming wise.
And it pays off. The design of the Cell processor has lead to the first
PetaFlop computer in the world in 2008;
http://www.top500.org/list/2008/06/100.
Anyways. Let's contrast this to Intel's Larrabee. Guess why has it failed?
It failed because of its cache coherent shared memory model which wasn't
able to deliver the data fast enough to the computational units. Let me give
you an example;
The PowerXCell's 8i peak-performance, counting only the SPEs (neglecting the
PPU) , computes as follows;
8 [SPE@3.2GHz]
= 8*(8 flops * 3.2GHz)
= 8 * 25.6 Gflops(SP)
= 204.8 Gflops(SP)
(SP := single precision)
Now to the interessting part. The PowerXCell 8i processor performs 202
GFLOPS on 4k x 4k SGEMM kernel utilizing 8 SPEs - a 4096x4096 matrix
multiplication in single precision, a well known test to judge the
computational efficiency of a multicore system, since the SGEMM kernel finds
many application in quite a lot of mathematical and physical computation.
Hence, the PowerXCell 8i as well as the Cell/B.E. processor inside the PS3
perform the SGEMM computational kernel with ~99% of its peak-performance! No
other processor in existence can match this number. To put the PowerXCell 8i
in perspective to Larrabee, with respect to the amount of cores, we have to
take two PowerXCell 8i to gain 16 SPEs. Two PowerXCell 8i processors perform
the SGEMM kernel at 406.04 GFLOPS, which amount to ~99% of the theoretical
peak performance of 409.60 GFLOPS, see Daniel Hackenberg - Fast Matrix
Multiplication on Cell (SMP) Systems
http://tu-dresden.de/die_tu_dresden...alyse_von_hochleistungsrechnern/cell//matmul/
Larrabee performs the 4k x 4k SGEMM kernel with 16 cores at 2Ghz with 825
GFLOPS, as was shown by Intel, which is only twice as fast as two PowerXCell
8i processors (16 SPEs) where one has to consider that the vector length
of Larrabee is 16 and that of Cell processor only 4.
What's the theoretical peak performance of the Larrabee configuration Intel
has run the test on?
Here it is;
16 [core@2.0GHz]
= 16*(32 flops * 2.0GHz)
= 16 * 64 Gflops(SP)
= 1024 Gflops(SP)
Now we can compute the efficiency of the SGEMM kernel for Larrabee;
(825 GFLOPS * 100) / 1024 = ~81%
Hence, we have
2 PowerXCell 8i @ SGEMM (4k x 4k) = 406.04 GFLOPS; efficiency = ~99%
Larrabee @ SGEMM (4k x 4k) = 825 GFLOPS; efficiency = ~81%
This shows that Larrabee's computational units get starved for date - a
weak spot of Larrabee's uniform memory architecture. It seems like that
Larrabee's memory model, an implicit cache coherent shared memory model,
can't deliver the data fast enough.
The explicit non-uniform memory model of the Cell processor is what makes
this processor so efficient.
Last but not least, I encourage you to read;
S. Williams, J. Carter, L. Oliker, J. Shalf, K. Yelick,
"Lattice Boltzmann Simulation Optimization on Leading Multicore Platforms",
International Parallel & Distributed Processing Symposium (IPDPS), 2008.
[PDF]:
http://bebop.cs.berkeley.edu/pubs/williams2008-multicore-lbmhd.pdf
From the section Summary and Conclusion;
"... Results show that the Cell processor offered (by far) the highest raw
performance and power efficiency for LBMHD, despite having peak
double-precision performance, memory bandwidth, and sustained system power
that is comparable to other platforms in our study. The key architectural
feature of Cell is explicit software control of data movement between the
local store (cache) and main memory. However, this impressive computational
efficiency comes with a high price — a difficult programming environment
that is a major departure from conventional programming. Nonetheless, these
performance disparities point to the deficiencies of existing
automatically-managed coherent cache hierarchies, even for architectures
with sophisticated hardware and software prefetch capabilities. The
programming effort required to compensate for these deficiencies demolishes
their initial productivity advantage. ...".
I'm not against uniform memory. If performance is not of utmost importance,
one can use system resources to simplify the architecture for various
reason. The (casual) market somehow becomes saturated by performance. And
since there is enough performance for the casual (gaming) market,
architectures (what you see as a programmer) can be simplified. However,
don't expect major breakthroughs or leaps. Perhaps that's also the reason
John Carmack said:
"not all that excited" by next-gen hardware.