Mesh Shaders
Mesh Shaders give developers more programmability than ever before. By bringing the full power of generalized GPU compute to the geometry pipeline, mesh shaders allow developers to build more detailed and dynamic worlds than ever before.
Prior to mesh shader, the GPU geometry pipeline hid the parallel nature of GPU hardware execution behind a simplified programming abstraction which only gave developers access to seemingly linear shader functions. For instance, the developer writes a vertex shader function that is called once for each vertex in a model, implying serial execution. However, behind the scenes, the hardware packs adjacent vertices to fill a SIMD wave, then executes 32 or 64 vertex shader functions in parallel on a single shader core. This model has worked extremely well for many years, but it is leaving performance and flexibility on the table by hiding the details of what the hardware is really doing from developers.
Mesh shaders change this by making geometry processing behave more like compute shaders. Rather than a single function that shades one vertex or one primitive, mesh shaders operate across an entire compute thread group, with access to group shared memory and advanced compute features such as cross-lane wave intrinsics that provide even more fine grained control over actual hardware execution. All these threads work together to shade a small indexed triangle list, called a ‘meshlet’. Typically there will be a phase of the mesh shader where each thread is working on a separate vertex, then another phase where each thread works on a separate primitive – but this model is completely flexible allowing data to be shared across threads, new vertices or primitives created as needed, existing primitives clipped or culled, etc.
Along with this new flexibility of thread allocation comes a flexibility of input data formats. Mesh shader no longer uses the Input Assembler block, which was previously responsible for fetching index and vertex data from memory. Instead, shader code is free to read whatever data is needed from any format it likes. This enables novel new techniques such as index buffer compression, or the use of multiple different index buffers for different channels of vertex data. Such approaches can reduce memory usage and also reduce the memory bandwidth used during rendering, thus boosting performance.
Although more flexible than the previous geometry pipeline, the mesh shader model is also much simpler:
Along with mesh shader comes an optional new shader stage called the Amplification Shader. This runs before the mesh shader, runs some computations, determines how many mesh shader thread groups are needed, and then launches that many mesh shaders:
Amplification shaders are especially useful for culling, as they can determine which meshlets are visible, testing each set of between 32-256 triangles against a geometric bounding volume, normal cone, or more advanced techniques such as portal visibility planes, before deciding whether to launch a mesh shader thread group for that meshlet. Previously, culling was typically performed on a coarser per-mesh level to decide whether to draw an object at all, and also on a finer per-triangle level at the end of the geometry pipeline. This new intermediate level of culling improves performance when drawing models that are only partially occluded. For instance, if part of a character is on screen while just one arm is not, an amplification shader can cull that entire arm after much less computation than it would have taken to shade all the triangles within it.