Iwilliams3 approached me with a question which I think bears sharing the answer to in this thread:
Where do those 192 threads (or ALUs) fit in in what we currently know about Latte?
An ATI SIMD unit form the VLIW era can handle as many treads per clock as is the unit's amount of VLIW ALUs. For instance, an 80-SP unit from the R700 series constitutes 16 VLIW5 units, and so can do 16 threads/clock (16x VLIW5 = 80 SPs). Let's call this amount of threads per clock Tc. Now, a wavefront is a group (or in ATI terms a 'vector') of 4 x Tc threads, which execute across 4 clocks on a SIMD - without getting into too much details, this grouping is due to the way the SIMD register file is accessed. Let's call this fundamental clock constant of 4 clocks Q.
A GPU comprising of 32 VLIW units can handle 128 threads over the timespan Q, or put in other words, a 16 VLIW-unit SIMD has a wavefront of 64 threads over 4 clocks, just like an 8 VLIW SIMD (like the one I'm typing this on) has a wavefront of 32 threads over 4 clocks, etc, etc. Now, there's a catch here - those threads are not just a random bunch of threads - those threads are identical across the entire SIMD, and the entire SIMD unit is executing the same VLIW op from those identical threads over the span of a wavefront, i.e. Q. That amount Tc x Q is fundamental for the SIMD of a GPU, for the entire design of that SIMD is made with a single goal - that Tc x Q threads could be in-flight (i.e. running or ready to be run with zero latency) at any given moment. Basically, a VLIW SIMD unit keeps the state (read: registers) of Tc x Q threads, while it runs Tc of those at any given clock. So far so good?
Now, there were guesses earlier in the thread that 192 could be the number of threads corresponding to the above Tc x Q quantity of Latte, plus some 'spares' - threads all loaded to a SIMD (or to both SIMDs) and ready to run or running at any time. Let's see how viable that is, but for that we need to introduce another fundamental paradigm of the VLIW era - the clauses. A clause in our context is a sequence of ops which are of the same 'nature' - say, ALU, or fetch, or export. The AMD shader compiler groups the ops from a shader/kernel into such clauses while honoring the logical flow of the shader, and the thread scheduler (AKA 'sequencer', or SQ in ATI terms) schedules threads based on the readiness of those clauses to run. Once a clause is run on a given SIMD unit, it does not get preempted until it voluntarily yields, i.e. the multithreading on the SIMD is cooperative (yes, there are watchdog timers to take care of clauses hogging the SIMD). Say, in a given shader clause A does some tex fetches, followed by clause B which does computations with the result from those fetches. The sequencer would see the data dependence between the two (i.e. A->B), schedule clause A first, and mark clause B as ready-to-run once the result from A has been delivered. Now, this happens per kernel, and if you remember, a SIMD unit can only execute the same op from the kernel across all threads. So if our SIMD just finished with a 64-wide wavefront, and during that time the results for some of the threads waiting on clause B arrived, then perhaps it could schedule those ready-to-run threads for the next wavefront? Not really, because it would create a divergence in the SIMD - it would start executing ops from clause B, then, some wavefronts down the line, the rest of the threads waiting to proceed with clause B would unblock, but low and behold, they cannot join their siblings already running B on the SIMD because those are several instructions ahead! So the fact we chose to run some threads (out of the pool of on-board threads) as early as possible actually proved disruptive! Or, to sum it up, from all threads aboard (ie. in-flight) a SIMD, you want to either run all of them, or not run them at all! Now, how wide was our wavefront? It was 64-thread wide (or 128 threads across both SIMDs) - do you see now why keeping 96 threads (192 across both SIMDs) in flight would not be justified?
Ok, I'm taking a break here - enough of a wall-of-text for now.