Scaled comparison of Llano and Trinity, using the I/O pads on the left side for reference:
Some observations on the layout of the SIMD multi-processors -- the placement of the register file banks in the ALU array is different in Trinity, as well as the whole layout of the texture unit.
Here are the differences (so far) on the CPU side -- BD vs. Piledriver cores:
Those banks are most probably the pre-decode bits (used for the BTB, branch selector, end bits & etc.), that AMD has been using ever since the first K7 architecture to aid the instruction decode flow. And since these are located in the branch prediction area of the front-end block, I guess AMD is aiming at improving namely this aspect of the architecture.
For more on BTB and brand prediction, The Real World Technologies "AMD's Bulldozer Microarchitecture" article, found in the OP, covers it on page four.
This is comparison of the IGP "uncore" sections of Llano and Trinity -- SIMDs are cut out too. Trinity's section takes 40% more area, compared to Llano's.