Thank you very much for the links and the information!
Edit: If that's not asking a bit too much, I have a doubt regarding the pipeline depth/width of Gekko/Broadway compared to Bobcat and the DMIPS results of both CPUs.
Bobcat has, like the Gekko, two ALUs who like the Gekko have different capacities (in both CPUs one can execute any instruction while the other one deals with every instruction except for divisions and multiplications), and for what I understand in Bobcat integer divisions are resolved using part of the FPU circuitry and they have a higher penalization.
The Bobcat has a better OOE circuitry, but it's for what I've read completely dual issue while the Gekko is dual issue but in case of a branch instruction it can issue up to 3 instructions per cycle (I've looked for that feature on the bobcat cpu's but or it's something as basic that it's a given nowadays and not explained, or the 2 issued instructions per cycle also contain the branch instructions).
For what wsippel explained on previous pages, the Gekko also has more integer registers than the Bobcat, and also a much shorter pipeline that helps on increasing performance per clock (less cycles wasted when the pipeline stalls).
To make things even worse (for the bobcat), it's L1 cache is only 2 way set-associative while on the Gekko it was 8 way set-associative.
But despite that, it seems that the bobcat has higher performance when it comes to integer calculations per cycle.
Could that be due to the more advanced OOE system of the Bobcat? Or there's something else there that I've missed or interpreted wrong? Maybe the micro ops of the bobcat are more powerful than the ones found on the PPC instruction set (meaning that what on the Bobcat it's a single instruction, on the Gekko/Broadway architecture has to be replicated with multiple instructions instead)?
I doubt that the bigger L2 cache has much impact on drystone tests like those (it's 512 KB on the bobcat vs 256 KB on the Gekko/Broadway, but those tests are as far as I know fully resolved on the 32+32 KB of L1 cache).
The bobcat information I have is from there so I doubt it's wrong:
http://www.agner.org/optimize/microarchitecture.pdf
Edit 2: Besides the comparison between Gekko/Broadway and Bobcat, what things to you think could be changed in the Espresso that would be easy to implement without breaking the BC? Maybe a bigger OOE circuitry that could reorder more than 6 instructions (so to speak, to make the instruction queue bigger than 6 entries) could be feasible? Increased registers in order to reduce the times some data has to be retrieved from the L1 data cache?
I know that I'm making a ton of answers... sorry!!! XD