[0060] In the technique 200b depicted in FIG. 2B, the vertex shader 210 may perform vertex shader computations 214, which may include manipulating various parameters of each vertices in the image. The resulting parameters may be compressed at 240 into a smaller data format so that bottlenecks associated with storage and throughput of large numbers may be minimized. The compressed parameters P0', P1', P2' may be written to a parameter cache 236 for temporary storage, and may occupy a smaller amount of the total cache than uncompressed parameters to thereby minimize potential bottlenecks in the cache hardware. The compressed parameters P0', P1', P2' may be copied to a local memory unit 237 on a GPU, which may be memory unit known as a "local data share" (LDS). The compressed parameter values may be accessed from the local data share with a pixel shader 212 implemented by a GPU.
Well, yes, that's actually what I'm trying to explain to
you. Here, try reading it like this:
[0060] In the technique 200b depicted in FIG. 2B, the vertex shader 210 may perform vertex shader computations 214, which may include manipulating various parameters of each vertices in the image. The resulting parameters may be compressed at 240 into a smaller data format so that bottlenecks associated with storage and throughput of large numbers may be minimized. The compressed parameters P0', P1', P2' may be written to a parameter cache 236 for temporary storage, and may occupy a smaller amount of the total cache than uncompressed parameters to thereby minimize potential bottlenecks in the cache hardware. The compressed parameters P0', P1', P2' may be copied to a local memory unit 237 on a GPU, which may be memory unit known as a "local data share" (LDS). The compressed parameter values may be accessed from the local data share with a pixel shader 212 implemented by a GPU.
[0061] The compressed parameter values P0', P1', P2' may be decompressed at 242 by the pixel shader 212, thereby granting access to the raw parameter values P0, P1, P2 by the pixel shader. The pixel shader 212 may then interpolate the parameter values at 216 using coordinates i,j obtained from the barycentric coordinate generator 238 in order to determine corresponding parameter values at the pixel locations within each primitive. Because the pixel shader 212 has access to the raw parameter values for each vertex of each triangle, the pixel shader 216 may also perform certain other manipulations of the vertex parameters (not pictured) and the visuals of the vertices in virtual space on a per-pixel basis before interpolation 216 to translate the values to screen space. The pixel shader may then perform pixel shader computations 218 to further manipulation of the pixel data and the visuals of the pixels before outputting the final pixel data, e.g., to a frame buffer.
Now see here, where after compression by the vertex shader, the data is then
decompressed by the pixel shader. (Yes, that part is important.)
See how the pixel shader needs to decompress the parameters at 242
before it's able to work with them at all? It's not working with FP32 values that have been magically transformed in to FP16 values. Instead, it's taking data that's been zipped by the vertex shader and
unzipping it for its own use; it's not able to work with the compressed data directly.
Also notice that compressing the data also bypasses the GPU's interpolation hardware shown here in the "traditional" method
so after performing the additional step of decompressing the values it's just loaded, the pixel shader must also perform
its own interpolation at 216.
So rather than granting the GPU a blanket ability to perform work in half the normal time, we've instead bestowed
extra work upon the shaders in hopes of boosting overall throughput
just a bit by making more efficient use of the necessarily limited caches.
Storage efficiency is increased, but execution efficiency is actually
decreased. (The hope is having more data compressed in to the caches will reduce the number of misses, and the increased hits will outweigh all of the additional computation being done by the shaders to be able to work with the compressed data in the first place.)