Transform feedback is terrible, so why are we doing it?

In the latest Vulkan spec update from Khronos (version 1.1.88), there’s a new extension called VK_EXT_transform_feedback. Some of you might be thinking, “Finally! Why’d it take them so long to add this obviously useful feature? It should have been there on day 1.” The answer to that question is that transform feedback (or streamout in D3D lingo) is a terrible feature that we all regret putting into OpenGL and OpenGL ES and we didn’t want that baggage in Vulkan.

Why is transform feedback terrible?

Transform feedback didn’t start off terrible. When it was first added to OpenGL in 2006, it provided some very useful functionality. You could now take the result of your geometry pipeline and use for whatever you wanted. You could read it from the CPU and feed it back into your physics engine or you could re-use it directly on the GPU and feed it back into another draw call. In some ways, this was OpenGL’s first form of compute shaders. Since the only other way to get data out of shaders prior to transform feedback was glReadPixels and friends, it was a pretty neat feature.

The real difficulty with transform feedback is a subtle requirement that isn’t explicitly stated anywhere in the spec that the data land in the transform feedback buffer in the same order as the input data. The OpenGL and Vulkan graphics pipelines are specified in terms of a theoretical pipeline which is executed one primitive at a time. Even though a modern GPU has thousands of shader cores all executing in parallel and potentially out-of-order, the end result has to be as if they executed serially. This is very important for things such as blending and depth/stencil testing because those calculations are potentially non-commutative and you can’t get consistent results without controlling the order in which those calculations occur. The reality, however, is that GPUs don’t have accomplish this by processing the primitives in-order; the only real requirement is that they process the blending operations in-order on a per-pixel basis. So, while on one part of the image, the GPU is blending primitive 17, it may be blending primitive 182 in some other part of the image.

With transform feedback, you have a similar ordering requirement. Without this requirement the feature would be almost useless since you wouldn’t be able to match input data to output data. However, this requirement is also the feature’s Achilles’ heel. While the serialization required for blending only occurs at the very end and happens on a per-pixel basis, the serialization required for transform feedback happens much earlier in the pipeline and serializes across the entire draw call and not just per-pixel. In 2006, when the feature was first added to OpenGL, GPUs still had lots of fixed-function hardware and very few shader cores. On modern GPUs with thousands of shaders in-flight at any given time, the primitive ordering requirement becomes much more painful.

You may be thinking, “What’s the big deal? You know the order the data came in, can’t you just write out-of-order but in the right spot in the buffer?” If only life were that easy… With a simple pipeline containing only a vertex shader, yes, you can do that. However, transform feedback also has to interact with geometry and tessellation shaders which produce an unknown number of primitives. Since transform feedback is specified using OpenGL’s theoretical serial execution model, that means that you first get all the primitives resulting from input primitive 0 followed by all the primitives resulting from input primitive 1, followed by 2, etc. Because you have no idea up-front how many output primitives will be produced from any given input primitive until the entire pipeline has been run, you really do have to wait until the last shader stage for primitive 41 has been executed before you know where to put the data resulting from primitive 42. Most desktop GPU vendors are carrying special hardware just to sort all this out without running the entire pipeline serially.

There’s a second issue that has arisen since 2006 which is also somewhat non-obvious: the rise of tiled architectures. Tiling GPU architectures have been around for a long time but in 2006, tiling had fallen out of favor and all three of the GPU vendors implementing OpenGL were immediate-mode renderers. On a tiled architecture, you frequently run part of the vertex pipeline up-front to perform the binning step and then re-run the pipeline a second time per-tile to actually generate all the information needed by the fragment shader. This means that the vertex shader may get run multiple times for any particular vertex. It may sound crazy to do duplicate work like that but it does end up being more efficient on those architectures and it’s allowed because the vertex shader doesn’t have any side-effects and the only thing it does is dump data into the fragment shader. The moment transform feedback is enabled, all that goes out the window because you have to process all the primitives in full (can’t drop any output) and in order. This leads to a significant performance drop because they can no longer play all their binning games and keep that data on-chip. It’s worth noting that tiling architectures do run into similar issues without transform feedback if you enable a geometry or tessellation shader but transform feedback certainly isn’t helping.

To sum it all up, transform feedback isn’t as great as it looks on the surface. In the modern world, we have compute shaders which can do basically everything that people actually need transform feedback to do. Want to transform some geometry and feed back into your physics engine? Use a compute shader. Want to compute some geometry for use in a future draw call? Use a compute shader. There isn’t nearly as much need for transform feedback now as there was then. Transform feedback does provide one bit of functionality over compute shaders which is that you can generate an arbitrary amount of output data from a single piece of input data and it’s guaranteed to be in-order. However, that feature isn’t nearly as useful as it sounds because you can’t figure out where that piece of data is in the output stream without starting at the beginning and adding up all the geometry shader outputs.

In light of the fact that transform feedback is painful to implement, comes at a significant performance cost on some architectures, and doesn’t provide significant functionality over compute shaders, we decided not to put it in Vulkan. This is a decision I supported then and I still support now. It should be considered legacy functionality and not used in new software.

So why are we implementing it?

Hopefully, the last section convinced you that transform feedback is a terrible legacy feature and doesn’t belong in a modern graphics API. The question then naturally arises, “Why are we adding it now?”

The answer is API translation. Over the course of the last year, many projects have arisen which attempt to translate other graphics APIs to Vulkan: DXVK, VKD3D, ANGLE, Zink, and GLOVE just to name a few. One thing that’s common among all of them is that the API which they are attempting to translate has some form of transform feedback. There are also tools such as RenderDoc that use transform feedback to capture the result of the geometry pipeline for debugging purposes.

For simple geometry pipelines containing only vertex shaders or where you can statically determine the number of primitives produced by the geometry shader, there are other options. If the Vulkan implementation supports vertexPipelineStoresAndAtomics feature, you can simply add SSBO writes to the last shader stage and compute the offset in the buffer to write based on gl_VertexId or gl_PrimitiveId. If the implementation does not support SSBO writes from vertex and geometry shaders, you can still fairly easily translate it into a compute shader at fairly little cost. For the more complex geometry and tessellation shader cases, however, the ordering guarantees come into play and cause significant headaches.

Initially, our answer to these complex use-cases was the same as our answer to new application developers: “Use compute shaders.” While compute shaders are a better fit for most applications, taking a entire geometry pipeline which has already been described in terms of vertex, tessellation, and geometry shaders and translating that into a compute shader is a giant pain. Such a translation is also likely to be significantly slower than what the GPU’s dedicated hardware can do. If we were only looking at one or two translation layers and we didn’t care about performance, that would likely still be the answer. However, with people wanting to run D3D games at full frame-rate on Vulkan via layers like DXVK and VKD3D, that’s not really a good answer.

In the end, then, we decided that the functionality was needed badly enough that we begrudgingly drafted the extension and accepted the burden of supporting the legacy functionality. As is explicitly stated in the extension text, the intention is that VK_EXT_transform_feedback will likely never become core Vulkan functionality that new applications (or even Vulkan ports of old ones) should find some other way to transform geometry on the GPU. However, for those cases where it really is needed, the functionality is now there.