I really don't agree with this. But I've seen "GP" cover so many different aspects that I really don't have a clear idea what you may be alluding to.
Well, colloquially, the GPGPU domain is considered everything that up until recently would run on CPUs, but today is being migrated to GPUs. Just because GPUs do these tasks better today. The fact a given 'GP' task historically started life on CPUs does not mean CPUs were well suited for it, or that the task was 'GP' as per today's understanding - CPUs are the default choice for a lot of tasks just due to historical reasons. I.e. until somebody discovers/designs a common part that does said task better.
GPGPUs are pretty much a huge collection of SIMD units. Alot like Cell SPE's except exponentially more HP.
Entire domains of tasks could benefit from a huge collection of ALUs, SIMD or otherwise.
It shares the same drawbacks though.
1. Memory Latency. The hardest one to shake because GPUs are sorta built to tolerate latency rather than combat it.
I'm not sure I follow. What's the difference between tolerate and combat in this case?
2. A Wide-Simd lane so it will generally struggle with anything logical
You mean it won't be efficient. But whether it will struggle or not depends on what is the logical code under consideration and what is the GPU ISA.
3. A lack of branch HW. This was the case for this gen so it won't be that big of a change next-generation but the few developers I know aren't fond of this at all.
I assume you mean lack of branch prediction hw, since GPUs have had flow control hw for some time now. It's not equivalent to CPU branch hw, but it could not be - branching in a massively parallel processor is an entirely different problem to branching in a single control flow.
Personally, I would rather just integrate a few decent SIMD units in the CPU and let my GPU do its thing but perspective is always intresting.
What would you rather see?
Of course a CPU needs its 'private' SIMD units, if nothing else just for sporadic low-latency tasks. The tricky question, though, is: How much is enough? Are you sure you're spending your transistor budget wisely by placing those SIMD units with the CPU?
Surprisingly (or not) here's what Intel more or less did with Larrabee - they took some mature-design cores, slapped some advanced-design SIMD silicon on them (and by that I mean lots and lots of it), packed them all together on a coherent fat SMP infrastructure and gave that contraption to some very clever sw guys (game industry veterans, et al) to 'do something amazing with it'. We know how that ended - Larrabee could run its own debugger, but at performance/watt at typical GPU tasks it could not compete with actual GPUs. So a CPU with tons of SIMD resources is not the panacea some saw in it.
I started my career with software rasterizers, so I'm not cold to GP-friendly SIMD arrays myself. Heck I fancy them. I even might actually think they're the future of graphics. But fat SIMD arrays have their own specific needs, which may not align well with the needs of a common CPU design, BW being a very apparent discrepancy, another being the lack of massive hw schedulers in the CPU (hyperthreading is barely scratching the surface of GPU schedulers). By putting such SIMD resources too deep in the CPU domain you inadvertently subject them to the 'ways of the CPU' - sw scheduling, small mem pools of big BW and large pools of not-so-big, perhaps abysmal (for the SIMDs) BW.