Hi guys,
For "general" deformation using mode 7, it is quite hopeless to get more performance with software rendering.
Hardware acceleration also would be very difficult to implement and most likely result in SLOWER performance than this current software implementation.
I optimized it as much as it could be done for both general and special cases.
In 0.2c, I had already optimized some stuff in the mode7.
This time I have even optimized even more...
But the result is not as good as I would have expected.
(Personnally expected a gain a +5% to +10% for the general cases of mode7 compare to 0.2c)
I think I got +1 or 2% compare to the 0.2c version of mode7 for the general cases. No more...
Now in the new version, I have optimized the cases where the rendering is a 0 degree of rotation...
Which is the case in many RPG maps

(Zelda, FF, etc...)
In these cases, the performance boost is simply huge.
Roughly 40-30%. Now of course, it is a 30% boost ONLY IN DRAWING... As the emulator is also doing the audio, cpu emulation stuff, don't expect a HUGE general performance jump. But a good one !!!
Now, on the FF6 map, with character walking on the map, audio desactivated, 333 Mhz, pure software approximation...
I reached a wooping 82~84 fps.
I asked yoyo if he could "blend" the approx mode of mode7 with the rendering of other part of the accurate mode. Because I believe we could probably run FF6 with high frame rate and still have the map in blending in the corner (which is not the case in approx mode, it disappear).
But once yoyo release that, we will not work on the mode7 anymore. There is no room left for optimization.
I even took a look to the assembly code and started to play with it, I am no mips expert but the code seems to be quite efficient enough. (17 instruction per pixel in the best case)
The number of instruction per pixel is quite low and the register dependancy and pipeline seems to be optimized as well by the GCC compiler.
I even started to optimize some stuff manually in assembly (register dependancy between instruction) and just managed to break the pipeline and got a -5 fps in my test compare to the C compiled code.