Chasing Triangles in a Tile-based Rasterizer

If you followed along or just came across my Rasterization in One Weekend blog series without having actually compiled and run the demo workloads, you’d be in for a big surprise if I told you now how slow they turned out to be. I hinted at some of the existing techniques at the end that one could employ to speed up a painfully slow rasterizer, but that was all. Now is time to go ahead and get a taste of how they’d be applied in practice.

As part of this work, I implemented Tyler, a tile-based rasterizer which we will dissect below to understand better. It’s been my aim for this project that the rasterizer be scalable, configurable and understandable by people who’d want to know a little more about the subject and ultimately experiment with it. With that said, what I’d like to show here relies quite a bit on what I laid out in Rasterization in One Weekend, so it’d be the best if you read through it. I will try not to make big assumptions, however, there’ll be less hand-holding and more high-level explanations and I want not to repeat stuff that can be found there or elsewhere easily, in order to keep it as tidy as possible here. In any case, feel free to shoot your questions my way, I’d be happy to answer them.


Tile-based rendering (or tiled rendering) is an enhancement over traditional immediate-mode rendering where a render target (RT) is divided into tiles (i.e. sub-regions of a frame buffer), each of which contains primitives that can be rendered into the tiles separately.

Note the word separately here, as it emphasizes one of the biggest selling points of this technique over immediate-mode: Trapping all accesses inside a tile to color/depth buffers which reside in “slow” DRAM by utilizing an exclusive on-chip tile cache. It is of course no coincidence that such bandwidth-saving technology has been utilized the most by mobile low-power GPUs. See how in below diagram (top) vertices are processed, rasterized and shaded immediately and in bottom they are deferred?

Immediate-mode vs Tile-based architecture (PowerVR)

On a high level, for tile-based rendering to work, all what we need to do in addition to the usual machinery of a rasterizer is for each tile, to build a list of primitives which overlap a given tile and then shade all primitives in all tiles either one tile at a time or even better, in parallel. Of course, the complexity which we bypass with such simple definition, is in the details, so we will look at implementation details of Tyler in multiple parts below.

This rasterizer unsurprisingly only implements a small subset of modern 3D rendering pipeline and consists of following stages that triangles go through as they get moved from vertex buffers to be pixels on a screen:

  • Vertex Shader
  • Clipper (full-triangle only!)
  • Triangle Setup & Cull
  • Binner
  • Rasterizer
  • Fragment Shader

Tyler includes a pseudo API for rendering stuff which imitates how a proper graphics API would look like, so let’s review first how it can be used to set up the rasterizer and submit drawcalls before entering its bowels. Note that I’ll be referring to Scene Rendering sample most of the time, which loads a .OBJ scene file and renders multiple meshes visualizing scaled normals for simplicity and optionally presents in a window.

Rasterizer context setup
// Create and initialize the rasterizer rendering context
RenderContext* pRenderContext = new RenderContext(config);

// Allocate color & depth buffers
uint8_t* pColorBuffer = ... // Color buffer format == R8G8B8A8_UNORM
float* pDepthBuffer = ... // Depth buffer format == D32_FLOAT

// Set up main FBO
Framebuffer fbo = {};
fbo.m_pColorBuffer = pColorBuffer;
fbo.m_pDepthBuffer = pDepthBuffer;
fbo.m_Width = opt.m_ScreenWidth;
fbo.m_Height = opt.m_ScreenHeight;

// Single vec3 attribute is used to pass vertex normal VS -> FS
ShaderMetadata metadata = { 0, 1 /*vertex normal*/, 0 };

// Emulate passing of constants data to shaders
Constants cb;
cb.m_ModelViewProj = proj * view * model;

// We will have single index and vertex buffer to draw indexed mesh
std::vector<Vertex> vertexBuffer;
std::vector<uint32_t> indexBuffer;
// Store data of all scene objects to be drawn
std::vector<Mesh> objects;

// Load .OBJ scene model data and generate vertex/index buffers, etc.
InitializeSceneObjects(opt.m_OBJName, objects, vertexBuffer, indexBuffer);

// Bind FBO to be used in subsequent render pass once

// Bind VBO and set buffer stride
pRenderContext->BindVertexBuffer(, sizeof(Vertex));
// Bind IBO

// Bind shader constants

// Bind shader, constant buffer, texture(s)
pRenderContext->BindShaders(VS, FS, metadata);
Main rendering loop
                // Clear RT
                    true, /*clearColor*/
                    glm::vec4(0, 0, 0, 1) /*colorValue*/,
                    true, /*clearDepth*/
                    FLT_MAX /*depthValue*/);

                // Draw meshes
                for (uint32_t obj = 0; obj < objects.size(); obj++)
                    Mesh& mesh = objects[obj];

                    view = glm::lookAt(testParams.m_EyePos, testParams.m_LookAtPos, glm::vec3(0, 1, 0));
                    //model = glm::rotate(model, glm::radians(0.5f), glm::vec3(0, 1, 0));
                    cb.m_ModelViewProj = proj * view * model;

                    // Kick off draw. Note that it blocks callee until drawcall is completed
                    pRenderContext->DrawIndexed(mesh.m_IdxCount, mesh.m_IdxOffset);


There is no magic going on the application side: After rendering context is set up with default configuration, usual binding of frame buffer, vertex/index buffers, shaders and all that is done and objects are rendered one mesh at a time. Where the tile-based rasterizer does its thing and we’ll focus on is on line 18, where a drawcall is triggered. Notice the comment that the rasterizer blocks the caller thread until current drawcall is processed and complete. Obviously, that is a grave difference from how real graphics APIs and HW rasterizers actually work: They never process one drawcall at a time and block until running to completion; that’d be awfully inefficient. Instead, they process batches of commands which applications record in a command buffer and submit to the HW with help of graphics drivers. Primary reasons to do this are 1) to minimize CPU <-> GPU work scheduling overhead and 2) to increase utilization of GPU resources.

Now that we have skimmed through how the sample makes a drawcall, we’re ready to enter the rasterizer side of things. If you look at the project source directory, you can see quite a few more files this time around. However, most of the important stuff is basically contained within Pipeline Thread and Render Engine and the rest just provides auxiliary functionality. Render Engine handles drawcall setup and various resources needed for the tile-based rasterizer such as per-tile bins and coverage mask buffers and provides synchronization between Pipeline Threads, each of which spawns its own thread to execute the above-mentioned pipeline stages in parallel. So how does Render Engine prepare a drawcall for the pipeline threads to execute exactly?

// Prepare for next drawcall

uint32_t numRemainingPrims = primCount;

uint32_t drawElemsPrev = 0u;
uint32_t numIter = 0;

while (numRemainingPrims > 0)
    // Prepare for next draw iteration

    // How many prims are to be processed this iteration & prims per thread
    uint32_t iterationSize = (numRemainingPrims >= m_RenderConfig.m_MaxDrawIterationSize) ? m_RenderConfig.m_MaxDrawIterationSize : numRemainingPrims;
    uint32_t perIterationRemainder = iterationSize % m_RenderConfig.m_NumPipelineThreads;
    uint32_t primsPerThread = iterationSize / m_RenderConfig.m_NumPipelineThreads;

    for (uint32_t threadIdx = 0; threadIdx < m_RenderConfig.m_NumPipelineThreads; threadIdx++)
        uint32_t currentDrawElemsStart = drawElemsPrev;
        uint32_t currentDrawElemsEnd = (threadIdx == (m_RenderConfig.m_NumPipelineThreads - 1)) ?
            // If number of remaining primitives in iteration is not multiple of number of threads, have the last thread cover the remaining range
            (currentDrawElemsStart + primsPerThread + perIterationRemainder) :
            currentDrawElemsStart + primsPerThread;

        // Threads must have been initialized and idle by now!
        PipelineThread* pThread = m_PipelineThreads[threadIdx];

        // Assign computed draw elems range for thread
        pThread->m_ActiveDrawParams.m_ElemsStart = currentDrawElemsStart;
        pThread->m_ActiveDrawParams.m_ElemsEnd = currentDrawElemsEnd;
        pThread->m_ActiveDrawParams.m_VertexOffset = vertexOffset;
        pThread->m_ActiveDrawParams.m_IsIndexed = isIndexed;

        // PipelineThread drawcall input prepared, it can start processing of drawcall
        pThread->, std::memory_order_release);

        drawElemsPrev = currentDrawElemsEnd;
        numRemainingPrims -= (currentDrawElemsEnd - currentDrawElemsStart);

    // All threads are assigned draw parameters, let them work now, std::memory_order_release);

    // Stall main thread until all active threads complete given draw iteration

What we do here is that we partition incoming primitives equally among pipeline threads to scale the rasterizer as number of available threads go up. You may be asking, what’s with this iteration thing? It stems from the fact that we have no way of knowing whether next drawcall contains 6 triangles or 5 million triangles in advance and we absolutely don’t want to dynamically allocate or resize our internal buffers; that’d make the whole thing an offline rasterizer! Instead we set an upper bound (can be arbitrarily large, we’ll ponder on what could be a good iteration size) for our internal buffers, allocate them once and at draw time simply iterate over the ranges of primitives that are sliced per iteration, which is logically equal to doing it all in one iteration. That’s akin to how HW rasterizers handle this unknown, too.

Exemplary drawcall setup: 3 threads processing 6 primitives per iteration in parallel

A few corner cases such as number of primitives not being multiple of number of threads or even less than number of threads could be present, so we handle them here, too. Having an upper bound for our internal buffers is well worth it, but it also necessitates us to do little extra book-keeping like invalidating stale data in our internal data structures and caches before every iteration. One last thing to note here is that we want all threads to enter their respective pipelines no earlier than when there is at least one thread whose drawcall parameters are not assigned yet, to not cause any synchronization issues, so we use an atomic flag, m_DrawcallSetupComplete to stall the threads until the drawcall setup is done. After this, threads are green-lit and we wait until all iterations of current drawcall are processed by the threads to return execution back to the caller thread.

Now that we have seen how the threads get triggered to process a drawcall, we can take a closer look at the pipeline itself.

Geometry Processing

Geometry processing refers to handling of vertices that are fetched from user buffers and comprises of vertex-shading, clipping, triangle setup and binning stages. Remember that by this point all threads are going to be running in parallel!

Geometry processing whereby each thread iterates over primitives in its assigned range
for (uint32_t drawIdx = m_ActiveDrawParams.m_ElemsStart, primIdx = m_ActiveDrawParams.m_ElemsStart % m_RenderConfig.m_MaxDrawIterationSize;
    drawIdx < m_ActiveDrawParams.m_ElemsEnd;
    drawIdx++, primIdx++)
    // drawIdx = Assigned prim indices which will be only used to fetch indices
    // primIdx = Prim index relative to current iteration

    // Clip-space vertices to be retrieved from VS
    glm::vec4 v0Clip, v1Clip, v2Clip;

    // VS
    ExecuteVertexShader<IsIndexed>(drawIdx, primIdx, &v0Clip, &v1Clip, &v2Clip);

    // Bbox of the primitive which will be computed during clipping
    Rect2D bbox;

    // CLIPPER
    if (!ExecuteFullTriangleClipping(primIdx, v0Clip, v1Clip, v2Clip, &bbox))
        // Triangle clipped, proceed iteration with next primitive

    if (!ExecuteTriangleSetupAndCull(primIdx, v0Clip, v1Clip, v2Clip))
        // Triangle culled, proceed iteration with next primitive

    // BINNER
    ExecuteBinner(primIdx, v0Clip, v1Clip, v2Clip, bbox);

The pipeline, as before, starts with Vertex Shader, whose job it is to fetch vertices/indices from bound vertex/index buffers and invoke a user-defined VS routine that will return clip-space positions as well as feeding attributes to shaded vertices. In addition to that, it leverages a small vertex shader cache, which is exclusive to each thread. If a vertex of an indexed mesh will be present in the VS$, instead of re-invoking VS, cached clip-space position and vertex attributes will be copied from the cache directly, which can save up significant amount of computation on vertex-heavy workloads. Otherwise, we will invoke the VS as usual and place a copy of the returned vertex data in the VS$. After clip-space position and attributes are collected, we also call CalculateInterpolationCoefficients(), which will calculate and store interpolation data that we will need to interpolate vertex attributes before passing them onto the FS routine. Notice that we keep a copy of transformed vertices in the loop, it is crucial for efficiency that those be kept in L1/L2/L3 (best case -> worst case) while a thread works over the geometry stages.

After completing VS and collecting vertex attributes, we continue with clipping triangles against view frustum and computing bounding boxes of primitives, i.e. screen-space enclosing area. As in the name of the function, we clip primitives at whole triangle granularity. Great, but what does it mean?

Looking at above drawing of X-W plane (from Blinn’s Calculating Screen Coverage), the top dark gray region represents points in front of the eye, hence completely visible whereas the bottom region represents points behind the eye. What we check for in clipper is whether all three vertices of a triangle are inside the completely visible, aka Trivial-Accept (TA) region or completely outside, aka Trivial-Reject (TR) region. If TA, we can safely compute bounding box of the primitive by applying homogeneous division (e.g. divide-by-W), else if TR, we can safely discard the whole primitive. Otherwise primitive must have vertices in different parts of the view frustum, which means that it needs to be clipped against the planes and drawn partially. However, it’s a feature of the homogeneous rasterization algorithm that its scan conversion method can render even partially visible primitives, which is something detailed in previous posts and the original paper. Hence, if Must-Clip condition occurs (i.e. neither TR nor TA), we will set the bounding box (overly conservative!) as the whole screen extent.

After full-triangle clipping is done, we continue with Triangle Setup and Culling where we calculate the familiar edge equation coefficients. Contrary to Rasterization In One Weekend implementation, this time I implemented the optimizations to the original paper, which cleverly gets rid of the need to invert vertex matrix:

    // First, transform clip-space (x, y, z, w) vertices to device-space 2D homogeneous coordinates (x, y, w)
    const glm::vec4 v0Homogen = TO_HOMOGEN(v0Clip, fbWidth, fbHeight);
    const glm::vec4 v1Homogen = TO_HOMOGEN(v1Clip, fbWidth, fbHeight);
    const glm::vec4 v2Homogen = TO_HOMOGEN(v2Clip, fbWidth, fbHeight);

    // To calculate EE coefficients, we need to set up a "vertex matrix" and invert it
    // M = |  x0  x1  x2  |
    //     |  y0  y1  y2  |
    //     |  w0  w1  w2  |

    // Alternatively, we can rely on the following relation between an inverse and adjoint of a matrix: inv(M) = adj(M)/det(M)
    // Since we use homogeneous coordinates, it's sufficient to only compute adjoint matrix:
    // A = |  a0  b0  c0  |
    //     |  a1  b1  c1  |
    //     |  a2  b2  c2  |

    float a0 = (v2Homogen.y * v1Homogen.w) - (v1Homogen.y * v2Homogen.w);
    float a1 = (v0Homogen.y * v2Homogen.w) - (v2Homogen.y * v0Homogen.w);
    float a2 = (v1Homogen.y * v0Homogen.w) - (v0Homogen.y * v1Homogen.w);

    float b0 = (v1Homogen.x * v2Homogen.w) - (v2Homogen.x * v1Homogen.w);
    float b1 = (v2Homogen.x * v0Homogen.w) - (v0Homogen.x * v2Homogen.w);
    float b2 = (v0Homogen.x * v1Homogen.w) - (v1Homogen.x * v0Homogen.w);

    float c0 = (v2Homogen.x * v1Homogen.y) - (v1Homogen.x * v2Homogen.y);
    float c1 = (v0Homogen.x * v2Homogen.y) - (v2Homogen.x * v0Homogen.y);
    float c2 = (v1Homogen.x * v0Homogen.y) - (v0Homogen.x * v1Homogen.y);

    // Additionally,
    // det(M) == 0 -> degenerate/zero-area triangle
    // det(M) < 0  -> back-facing triangle
    float detM = (c0 * v0Homogen.w) + (c1 * v1Homogen.w) + (c2 * v2Homogen.w);

    // Assign computed EE coefficients for given primitive
    m_pRenderEngine->m_SetupBuffers.m_pEdgeCoefficients[3 * primIdx + 0] = { a0, b0, c0 };
    m_pRenderEngine->m_SetupBuffers.m_pEdgeCoefficients[3 * primIdx + 1] = { a1, b1, c1 };
    m_pRenderEngine->m_SetupBuffers.m_pEdgeCoefficients[3 * primIdx + 2] = { a2, b2, c2 };

Another benefit of this technique is that it lets us use the same edge equation coefficients for both scan conversion and attribute interpolation. What we did last time is that we computed a parameter interpolation vector for each distinct parameter, which could quickly bottleneck if you’ll have more attributes than just a few scalars, i.e. normals + texture coordinates. After getting Triangle Setup done, we reach a very critical part of the tile-based rasterizer, which is Binning (many concepts of which are explained beautifully by none other than Michael Abrash) stage. The role of the Binning in the pipeline is enormous: It needs to find which tiles cover which primitives, or put another way, which primitives overlap which tiles. Note that a tile is a sub-region of RT which comprises of 8×8 blocks by default.

For this, we will simply look at a single triangle and how it’d be processed, but remember, again, that all that is being done in parallel by multiple threads. And the way we achieve this is simple: For each tile, we allocate an array of overlapping primitives per thread. That is, by having each thread bin primitives in its own per-thread bin, we can keep them working simultaneously with no need to synchronize. That is also how we can preserve rendering order of primitives, by later going over binned primitives in thread order.

Colorful arrows represent edge normals while the enclosing rectangle is bounding box of the triangle.

Given a primitive, its bounding box and edge equation coefficients, the information that binning will provide us with is:

  • Trivial-Accept: Tile within triangle’s bounding box is completely covered by triangle
  • Trivial-Reject: Tile within triangle’s bounding box is completely outside triangle
  • Overlap: Tile within triangle’s bounding box is intersecting triangle

And how is that useful exactly? Assuming we choose a tile size of 64×64 pixels, trivially rejecting a tile means we can skip 64×64 per-pixel tests, all at once. Similarly, trivially accepting a tile means 64×64 pixels are visible, just draw the whole tile! Overlap is our least favorite result here, as it means that this tile and primitive need further processing in Rasterization stage where we will descent into block level and apply edge tests at finer level.

We start the binning process by first finding minimum and maximum tile indices that the bounding box of the triangle covers which we’ll loop over:

    // Given a tile size and frame buffer dimensions, find min/max range of the tiles that fall within bbox computed above
    // which we're going to iterate over, in order to determine if the primitive should be binned or not

    // Use floor(), min indices are inclusive
    uint32_t minTileX = static_cast<uint32_t>(glm::floor(bbox.m_MinX / m_RenderConfig.m_TileSize));
    uint32_t minTileY = static_cast<uint32_t>(glm::floor(bbox.m_MinY / m_RenderConfig.m_TileSize));

    // Use ceil(), max indices are exclusive
    uint32_t maxTileX = static_cast<uint32_t>(glm::ceil(bbox.m_MaxX / m_RenderConfig.m_TileSize));
    uint32_t maxTileY = static_cast<uint32_t>(glm::ceil(bbox.m_MaxY / m_RenderConfig.m_TileSize));

After fetching edge equation coefficients, we proceed with setting up TR and TA corners of each edge:

    // Fetch edge equation coefficients computed in triangle setup
    glm::vec3 ee0 = m_pRenderEngine->m_SetupBuffers.m_pEdgeCoefficients[3 * primIdx + 0];
    glm::vec3 ee1 = m_pRenderEngine->m_SetupBuffers.m_pEdgeCoefficients[3 * primIdx + 1];
    glm::vec3 ee2 = m_pRenderEngine->m_SetupBuffers.m_pEdgeCoefficients[3 * primIdx + 2];

    // Normalize edge functions
    ee0 /= (glm::abs(ee0.x) + glm::abs(ee0.y));
    ee1 /= (glm::abs(ee1.x) + glm::abs(ee1.y));
    ee2 /= (glm::abs(ee2.x) + glm::abs(ee2.y));

    // Indices of tile corners:
    // LL -> 0  LR -> 1
    // UL -> 2  UR -> 3

    static const glm::vec2 scTileCornerOffsets[] =
        { 0.f, 0.f},                                            // LL
        { m_RenderConfig.m_TileSize, 0.f },                     // LR
        { 0.f, m_RenderConfig.m_TileSize },                     // UL
        { m_RenderConfig.m_TileSize, m_RenderConfig.m_TileSize} // UR

    // (x, y) -> sample location | (a, b, c) -> edge equation coefficients
    // E(x, y) = (a * x) + (b * y) + c
    // E(x + s, y + t) = E(x, y) + (a * s) + (b * t)

    // Based on edge normal n=(a, b), set up tile TR corners for each edge
    const uint8_t edge0TRCorner = (ee0.y >= 0.f) ? ((ee0.x >= 0.f) ? 3u : 2u) : (ee0.x >= 0.f) ? 1u : 0u;
    const uint8_t edge1TRCorner = (ee1.y >= 0.f) ? ((ee1.x >= 0.f) ? 3u : 2u) : (ee1.x >= 0.f) ? 1u : 0u;
    const uint8_t edge2TRCorner = (ee2.y >= 0.f) ? ((ee2.x >= 0.f) ? 3u : 2u) : (ee2.x >= 0.f) ? 1u : 0u;

    // TA corner is the one diagonal from TR corner calculated above
    const uint8_t edge0TACorner = 3u - edge0TRCorner;
    const uint8_t edge1TACorner = 3u - edge1TRCorner;
    const uint8_t edge2TACorner = 3u - edge2TRCorner;
TR corner of each tile for edge 2 marked (X extends right, Y points downward!)

TR corner of a tile is the one that’s most inside a given edge whereas TA corner is the one that’s most outside. The reason we need TR and TA corners is that if we can conclude that TR corner of any edge is outside an edge, the whole tile must be outside the triangle, hence can be trivially rejected. Similarly, if TA corners of all edges are inside respective edges, the whole tile must be inside the triangle (or vice versa), hence can be trivially accepted.

The way we find these corners in above code fragment is one of the most important parts for the tile-based rasterizer: We select TR/TA corners based on the slope of edge normal=(a, b) (a, b being x and y components of the normal). In above figure, we can see that for edge 2 (e.g. blue arrow), TR corner for all tiles is marked lower right. Why? For edge 2, it holds that (a > 0), which means that the edge must extend towards left, which means one of the right corners (lower or upper) must be closer to the edge than left ones. Similarly, (b > 0) holds true, which means that TR corner must be the lower right corner. To find TA corner, we have two options: Either apply the same logic or rely on the fact that the furthest corner from TR corner will be most outside, which is diagonal from TR corner, which’d be more optimal.

// Iterate over calculated range of tiles
for (uint32_t ty = minTileY, tyy = 0; ty < maxTileY; ty++, tyy++)
    for (uint32_t tx = minTileX, txx = 0; tx < maxTileX; tx++, txx++)
        // Using EE coefficients calculated in TriangleSetup stage and positive half-space tests, determine one of three cases possible for each tile:
        // 1) TrivialReject -- tile within tri's bbox does not intersect tri -> move on
        // 2) TrivialAccept -- tile within tri's bbox is completely within tri -> emit a full-tile coverage mask
        // 3) Overlap       -- tile within tri's bbox intersects tri -> bin the triangle to given tile for further rasterization where block/pixel-level coverage masks will be emitted

        // (txx, tyy) = how many steps are done per dimension
        const float txxOffset = static_cast<float>(txx * m_RenderConfig.m_TileSize);
        const float tyyOffset = static_cast<float>(tyy * m_RenderConfig.m_TileSize);

        // Step from edge function computed above for the first tile in bbox
        float edgeFuncTR0 = edgeFunc0 + ((ee0.x * (scTileCornerOffsets[edge0TRCorner].x + txxOffset)) + (ee0.y * (scTileCornerOffsets[edge0TRCorner].y + tyyOffset)));
        float edgeFuncTR1 = edgeFunc1 + ((ee1.x * (scTileCornerOffsets[edge1TRCorner].x + txxOffset)) + (ee1.y * (scTileCornerOffsets[edge1TRCorner].y + tyyOffset)));
        float edgeFuncTR2 = edgeFunc2 + ((ee2.x * (scTileCornerOffsets[edge2TRCorner].x + txxOffset)) + (ee2.y * (scTileCornerOffsets[edge2TRCorner].y + tyyOffset)));

        // If TR corner of the tile is outside any edge, reject whole tile
        bool TRForEdge0 = (edgeFuncTR0 < 0.f);
        bool TRForEdge1 = (edgeFuncTR1 < 0.f);
        bool TRForEdge2 = (edgeFuncTR2 < 0.f);
        if (TRForEdge0 || TRForEdge1 || TRForEdge2)
            LOG("Tile %d TR'd by thread %d\n", m_pRenderEngine->GetGlobalTileIndex(tx, ty), m_ThreadIdx);

            // TrivialReject
            // Tile is completely outside of one or more edges
            // Tile is partially or completely inside one or more edges, do TrivialAccept tests first

            // Compute edge functions at TA corners based on edge function at first tile origin
            float edgeFuncTA0 = edgeFunc0 + ((ee0.x * (scTileCornerOffsets[edge0TACorner].x + txxOffset)) + (ee0.y * (scTileCornerOffsets[edge0TACorner].y + tyyOffset)));
            float edgeFuncTA1 = edgeFunc1 + ((ee1.x * (scTileCornerOffsets[edge1TACorner].x + txxOffset)) + (ee1.y * (scTileCornerOffsets[edge1TACorner].y + tyyOffset)));
            float edgeFuncTA2 = edgeFunc2 + ((ee2.x * (scTileCornerOffsets[edge2TACorner].x + txxOffset)) + (ee2.y * (scTileCornerOffsets[edge2TACorner].y + tyyOffset)));

            // If TA corner of the tile is outside all edges, accept whole tile
            bool TAForEdge0 = (edgeFuncTA0 >= 0.f);
            bool TAForEdge1 = (edgeFuncTA1 >= 0.f);
            bool TAForEdge2 = (edgeFuncTA2 >= 0.f);
            if (TAForEdge0 && TAForEdge1 && TAForEdge2)
                // TrivialAccept
                // Tile is completely inside of the triangle, no further rasterization is needed,
                // whole tile will be fragment-shaded!

                LOG("Tile %d TA'd by thread %d\n", m_pRenderEngine->GetGlobalTileIndex(tx, ty), m_ThreadIdx);

                // Append tile to the rasterizer queue
                m_pRenderEngine->EnqueueTileForRasterization(m_pRenderEngine->GetGlobalTileIndex(tx, ty));

                CoverageMask mask;
                mask.m_SampleX = static_cast<uint32_t>(tilePosX + txxOffset); // Based off of first tile position calculated above
                mask.m_SampleY = static_cast<uint32_t>(tilePosY + tyyOffset); // Based off of first tile position calculated above
                mask.m_PrimIdx = primIdx;
                mask.m_Type = CoverageMaskType::TILE;

                // Emit full-tile coverage mask
                    m_pRenderEngine->GetGlobalTileIndex(tx, ty),
                LOG("Tile %d binned by thread %d\n", m_pRenderEngine->GetGlobalTileIndex(tx, ty), m_ThreadIdx);

                // Overlap
                // Tile is partially covered by the triangle, bin the triangle for the tile
                    m_pRenderEngine->GetGlobalTileIndex(tx, ty),

Once we find TR and TA corners for all three edges, we enter the loop where we iterate over tiles within the range [minTile{X|Y}, maxTile{X|Y}) to apply inside/outside tests for each tile. That’s simply the same test we’ve seen before, only difference being that the test is being performed for tile TR/TA corners. If a tile is TR’d, we move on. If a tile is TA’d, we emit a coverage mask for the whole tile (and make sure the tile is in the rasterizer queue) and continue. In the unfortunate event that a tile happens to be overlapping a tile, we bin the primitive to tile’s per-thread bin, which is as simple as doing so:

void RenderEngine::BinPrimitiveForTile(uint32_t threadIdx, uint32_t tileIdx, uint32_t primIdx)
    // Add primIdx to the per-thread bin of a tile

    std::vector<uint32_t>& tileBin = m_BinList[tileIdx][threadIdx];

    if (tileBin.empty())
        // First encounter of primitive for tile, enqueue it for rasterization
        // Tile must have been already appended to the work queue

    // Append primIdx to the tile's bin
void RenderEngine::EnqueueTileForRasterization(uint32_t tileIdx)
    // Append the tile to the rasterizer queue if not already done
    if (!m_TileList[tileIdx].m_IsTileQueued.test_and_set(std::memory_order_acq_rel))
        // Tile not queued up for rasterization, do so now

Rasterizer queue is a simple FIFO of tile indices that are waiting to be rasterized (in case a tile was overlapping a primitive) or fragment-shaded (TA’d tiles) in next stages. With that we can wrap up the geometry processing and proceed with rasterization. For reference, here is Hello, Triangle! sample by default vs trivially-accepted tiles no-op’d:


Before any thread can proceed to the rasterizer stage, they stall until all threads have reached post-binning stage, which is necessary to establish that all primitives have been binned to tiles before rasterizer goes and does its thing. As soon as all threads complete all stages of geometry processing, they start processing tiles from the rasterizer queue.

// To preserve rendering order, we must ensure that all threads finish binning primitives to tiles
// before rasterization is started. To do that, we will stall all threads to sync @DRAWCALL_RASTERIZATION
// Set state to post binning and stall until all PipelineThreads complete binning, std::memory_order_release);

The essence of Rasterizer is remarkably simple if you grasped the idea of binning: We apply the same set of TR/TA tests at block level this time, descending from tile level:

Threads fetch next available tile index from the rasterizer queue and operate on given tile by applying above-mentioned tests for all primitives binned to this tile in parallel again.

As in the binning stage, using the bounding box, we find the minimum and maximum block indices which overlap the primitive, determine block TR/TA corners for all edges and loop over the range to see what blocks could be TR’d or TA’d or ignored altogether. If a block is TR’d, again, we move on. If it’s TA’d, we emit a coverage mask for the whole block and continue. Otherwise, we need to descend once more, into pixel level to perform edge tests to emit coverage masks for pixels. The nice thing is that we can rasterize multiple pixels in parallel by leveraging SIMD:

// Position of the block that we're testing at pixel level
float blockPosX = (firstBlockWithinBBoxX + bxxOffset);
float blockPosY = (firstBlockWithinBBoxY + byyOffset);

// Compute E(x, y) = (x * a) + (y * b) c at block origin once
__m128 sseEdge0FuncAtBlockOrigin = _mm_set1_ps(ee0.z + ((ee0.x * blockPosX) + (ee0.y * blockPosY)));
__m128 sseEdge1FuncAtBlockOrigin = _mm_set1_ps(ee1.z + ((ee1.x * blockPosX) + (ee1.y * blockPosY)));
__m128 sseEdge2FuncAtBlockOrigin = _mm_set1_ps(ee2.z + ((ee2.x * blockPosX) + (ee2.y * blockPosY)));

// Store edge 0 equation coefficients
__m128 sseEdge0A4 = _mm_set_ps1(ee0.x);
__m128 sseEdge0B4 = _mm_set_ps1(ee0.y);

// Store edge 1 equation coefficients
__m128 sseEdge1A4 = _mm_set_ps1(ee1.x);
__m128 sseEdge1B4 = _mm_set_ps1(ee1.y);

// Store edge 2 equation coefficients
__m128 sseEdge2A4 = _mm_set_ps1(ee2.x);
__m128 sseEdge2B4 = _mm_set_ps1(ee2.y);

// Generate masks used for tie-breaking rules (not to double-shade along shared edges)
__m128 sseEdge0A4PositiveOrB4NonNegativeA4Zero = _mm_or_ps(_mm_cmpgt_ps(sseEdge0A4, _mm_setzero_ps()),
    _mm_and_ps(_mm_cmpge_ps(sseEdge0B4, _mm_setzero_ps()), _mm_cmpeq_ps(sseEdge0A4, _mm_setzero_ps())));

__m128 sseEdge1A4PositiveOrB4NonNegativeA4Zero = _mm_or_ps(_mm_cmpgt_ps(sseEdge1A4, _mm_setzero_ps()),
    _mm_and_ps(_mm_cmpge_ps(sseEdge1B4, _mm_setzero_ps()), _mm_cmpeq_ps(sseEdge1A4, _mm_setzero_ps())));

__m128 sseEdge2A4PositiveOrB4NonNegativeA4Zero = _mm_or_ps(_mm_cmpgt_ps(sseEdge2A4, _mm_setzero_ps()),
    _mm_and_ps(_mm_cmpge_ps(sseEdge2B4, _mm_setzero_ps()), _mm_cmpeq_ps(sseEdge2A4, _mm_setzero_ps())));

for (uint32_t py = 0; py < g_scPixelBlockSize; py++)
    // Store Y positions in current row (all samples on the same row has the same Y position)
    __m128 sseY4 = _mm_set_ps1(py + 0.5f);

    for (uint32_t px = 0; px < g_scNumEdgeTestsPerRow; px++)
        // E(x, y) = (x * a) + (y * b) + c
        // E(x + s, y + t) = E(x, y) + s * a + t * b

        // Store X positions of 4 consecutive samples
        __m128 sseX4 = _mm_setr_ps(
            g_scSIMDWidth * px + 0.5f,
            g_scSIMDWidth * px + 1.5f,
            g_scSIMDWidth * px + 2.5f,
            g_scSIMDWidth * px + 3.5f);

        // a * s
        __m128 sseEdge0TermA = _mm_mul_ps(sseEdge0A4, sseX4);
        __m128 sseEdge1TermA = _mm_mul_ps(sseEdge1A4, sseX4);
        __m128 sseEdge2TermA = _mm_mul_ps(sseEdge2A4, sseX4);

        // b * t
        __m128 sseEdge0TermB = _mm_mul_ps(sseEdge0B4, sseY4);
        __m128 sseEdge1TermB = _mm_mul_ps(sseEdge1B4, sseY4);
        __m128 sseEdge2TermB = _mm_mul_ps(sseEdge2B4, sseY4);

        // E(x+s, y+t) = E(x,y) + a*s + t*b
        __m128 sseEdgeFunc0 = _mm_add_ps(sseEdge0FuncAtBlockOrigin, _mm_add_ps(sseEdge0TermA, sseEdge0TermB));
        __m128 sseEdgeFunc1 = _mm_add_ps(sseEdge1FuncAtBlockOrigin, _mm_add_ps(sseEdge1TermA, sseEdge1TermB));
        __m128 sseEdgeFunc2 = _mm_add_ps(sseEdge2FuncAtBlockOrigin, _mm_add_ps(sseEdge2TermA, sseEdge2TermB));

        //E(x, y):
        //    E(x, y) > 0
        //        ||
        //    !E(x, y) < 0 && (a > 0 || (a = 0 && b >= 0))

        // Edge 0 test
        __m128 sseEdge0Positive = _mm_cmpgt_ps(sseEdgeFunc0, _mm_setzero_ps());
        __m128 sseEdge0Negative = _mm_cmplt_ps(sseEdgeFunc0, _mm_setzero_ps());
        __m128 sseEdge0FuncMask = _mm_or_ps(sseEdge0Positive,
            _mm_andnot_ps(sseEdge0Negative, sseEdge0A4PositiveOrB4NonNegativeA4Zero));

        // Edge 1 test
        __m128 sseEdge1Positive = _mm_cmpgt_ps(sseEdgeFunc1, _mm_setzero_ps());
        __m128 sseEdge1Negative = _mm_cmplt_ps(sseEdgeFunc1, _mm_setzero_ps());
        __m128 sseEdge1FuncMask = _mm_or_ps(sseEdge1Positive,
            _mm_andnot_ps(sseEdge1Negative, sseEdge1A4PositiveOrB4NonNegativeA4Zero));

        // Edge 2 test
        __m128 sseEdge2Positive = _mm_cmpgt_ps(sseEdgeFunc2, _mm_setzero_ps());
        __m128 sseEdge2Negative = _mm_cmplt_ps(sseEdgeFunc2, _mm_setzero_ps());
        __m128 sseEdge2FuncMask = _mm_or_ps(sseEdge2Positive,
            _mm_andnot_ps(sseEdge2Negative, sseEdge2A4PositiveOrB4NonNegativeA4Zero));

        // Combine resulting masks of all three edges
        __m128 sseEdgeFuncResult = _mm_and_ps(sseEdge0FuncMask,
            _mm_and_ps(sseEdge1FuncMask, sseEdge2FuncMask));

        uint16_t maskInt = static_cast<uint16_t>(_mm_movemask_ps(sseEdgeFuncResult));

        // If at least one sample is visible, emit coverage mask for the tile
        if (maskInt != 0x0)
            // Quad mask points to the first sample
            CoverageMask mask;
            mask.m_SampleX = static_cast<uint32_t>(blockPosX + (g_scSIMDWidth * px));
            mask.m_SampleY = static_cast<uint32_t>(blockPosY + py);
            mask.m_PrimIdx = primIdx;
            mask.m_Type = CoverageMaskType::QUAD;
            mask.m_QuadMask = maskInt;

            // Emit a quad mask
            m_pRenderEngine->AppendCoverageMask(m_ThreadIdx, nextTileIdx, mask);

Assuming you’ll excuse this ugly SIMD soup (I blame partly MSVC for generating code that’s rather non-optimal for such trivial arithmetic operations, sigh), it should be fairly easy to follow here: We pick 4 pixels in a row of a block, which is 8×8 pixels by default, perform edge tests and generate a coverage mask for the 4-fragment group. If such group is found to contain at least one sample who’s visible (e.g if (maskInt != 0x0 check), we emit a coverage mask for this group of fragments and continue until we have iterated all pixels in a block. Rinse and repeat for all blocks in the range and there is our Rasterizer!

The tricky thing here is that it’ll only pay off to incur overhead for rasterization at such finer level as long as majority of the binned triangles in a tile will be bigger than a single block, as otherwise we end up throwing away all the computations done at higher levels, which is why HW implementations usually have multiple mitigations for small triangles.

Fragment Shading

Having reached post-rasterization stage, we encounter another synchronization point:

// Rasterization completed, set state to post raster and
// stall until all PipelineThreads complete rasterization.
// We need this sync because when (N-x) threads finish rasterization and
// reach the end of tile queue while x threads are still busy rasterizing tile blocks,
// we must ensure that none of the (N-x) non-busy threads will go ahead and start fragment-shading tiles
// whose blocks could be currently still rasterized by x remaining threads, std::memory_order_release);

Ensuring that all threads reach FS stage at the same time after stalling until all of them complete rasterization, we enter the last stage of the pipeline. As in rasterization stage, FS will operate on tiles fetched from the rasterizer queue in parallel by consuming coverage masks that we’ve emitted first in Binning and then Rasterization stages.

auto& currentSlot = pCoverageMaskBuffer->m_AllocationList[numAlloc];

for (uint32_t numMask = 0; numMask < currentSlot.m_AllocationCount; numMask++)
    ASSERT(pCoverageMaskBuffer->m_AllocationList[numAlloc].m_pData != nullptr);

    CoverageMask* pMask = &currentSlot.m_pData[numMask];

    // In many cases, next N coverage masks will have been generated for the same primitive
    // that we're fragment-shading at tile, block or fragment levels here,
    // it could be optimized so that the EE coefficients of the same primitive won't be fetched
    // from memory over and over again, unsure what gain, if anything it'd yield...

    // First fetch EE coefficients that will be used (in addition to edge in/out tests) for perspective-correct interpolation of vertex attributes
    const glm::vec3 ee0 = m_pRenderEngine->m_SetupBuffers.m_pEdgeCoefficients[3 * pMask->m_PrimIdx + 0];
    const glm::vec3 ee1 = m_pRenderEngine->m_SetupBuffers.m_pEdgeCoefficients[3 * pMask->m_PrimIdx + 1];
    const glm::vec3 ee2 = m_pRenderEngine->m_SetupBuffers.m_pEdgeCoefficients[3 * pMask->m_PrimIdx + 2];

    // Store edge 0 coefficients
    __m128 sseA4Edge0 = _mm_set_ps1(ee0.x);
    __m128 sseB4Edge0 = _mm_set_ps1(ee0.y);
    __m128 sseC4Edge0 = _mm_set_ps1(ee0.z);

    // Store edge 1 equation coefficients
    __m128 sseA4Edge1 = _mm_set_ps1(ee1.x);
    __m128 sseB4Edge1 = _mm_set_ps1(ee1.y);
    __m128 sseC4Edge1 = _mm_set_ps1(ee1.z);

    // Store edge 2 equation coefficients
    __m128 sseA4Edge2 = _mm_set_ps1(ee2.x);
    __m128 sseB4Edge2 = _mm_set_ps1(ee2.y);
    __m128 sseC4Edge2 = _mm_set_ps1(ee2.z);

    const SIMDEdgeCoefficients simdEERegs =

    switch (pMask->m_Type)
    case CoverageMaskType::TILE:
        LOG("Thread %d fragment-shading tile %d\n", m_ThreadIdx, nextTileIdx);
        FragmentShadeTile(pMask->m_SampleX, pMask->m_SampleY, pMask->m_PrimIdx, simdEERegs);
    case CoverageMaskType::BLOCK:
        LOG("Thread %d fragment-shading blocks\n", m_ThreadIdx);
        FragmentShadeBlock(pMask->m_SampleX, pMask->m_SampleY, pMask->m_PrimIdx, simdEERegs);
    case CoverageMaskType::QUAD:
        LOG("Thread %d fragment-shading coverage masks\n", m_ThreadIdx, ee0, ee1, ee2);
        FragmentShadeQuad(pMask, simdEERegs);

Before we invoke the user-defined FS routine, first we compute what’s called parameter basis functions, which is a direct implementation of the method described in the paper I’ve linked before. The essence of this is that instead of computing an interpolation vector per parameter, they find basis functions by which any parameter can be (perspective-correct) interpolated:

void PipelineThread::ComputeParameterBasisFunctions(
    uint32_t sampleX,
    uint32_t sampleY,
    const SIMDEdgeCoefficients& simdEERegs,
    __m128* pSSEf0XY,
    __m128* pSSEf1XY)
    // R(x, y) = F0(x, y) + F1(x, y) + F2(x, y)
    // r = 1/(F0(x, y) + F1(x, y) + F2(x, y))

    // Store X positions of 4 consecutive samples
    __m128 sseX4 = _mm_setr_ps(
        sampleX + 0.5f,
        sampleX + 1.5f,
        sampleX + 2.5f,
        sampleX + 3.5f); // x x+1 x+2 x+3

    // Store Y positions of 4 samples in a row (constant)
    __m128 sseY4 = _mm_set_ps1(sampleY); // y y y y

    // Compute F0(x,y)
    __m128 sseF0XY4 = _mm_add_ps(simdEERegs.m_SSEC4Edge0,
            _mm_mul_ps(sseY4, simdEERegs.m_SSEB4Edge0),
            _mm_mul_ps(sseX4, simdEERegs.m_SSEA4Edge0)));

    // Compute F1(x,y)
    __m128 sseF1XY4 = _mm_add_ps(simdEERegs.m_SSEC4Edge1,
            _mm_mul_ps(sseY4, simdEERegs.m_SSEB4Edge1),
            _mm_mul_ps(sseX4, simdEERegs.m_SSEA4Edge1)));

    // Compute F2(x,y)
    __m128 sseF2XY4 = _mm_add_ps(simdEERegs.m_SSEC4Edge2,
            _mm_mul_ps(sseY4, simdEERegs.m_SSEB4Edge2),
            _mm_mul_ps(sseX4, simdEERegs.m_SSEA4Edge2)));

    // Compute F(x,y) = F0(x,y) + F1(x,y) + F2(x,y)
    __m128 sseR4 = _mm_add_ps(sseF2XY4, _mm_add_ps(sseF0XY4, sseF1XY4));

    // Compute perspective correction factor
    sseR4 = _mm_rcp_ps(sseR4);

    // Assign final f0(x,y) & f1(x,y)
    *pSSEf0XY = _mm_mul_ps(sseR4, sseF0XY4);
    *pSSEf1XY = _mm_mul_ps(sseR4, sseF1XY4);

    // Basis functions f0, f1, f2 sum to 1, e.g. f0(x,y) + f1(x,y) + f2(x,y) = 1 so we'll skip computing f2(x,y) explicitly

Implementation for different coverage masks (tile/block/quads) is almost the same so I will refer to the quad mask for completeness. Note that this is the tightest code path, so we try to keep as SIMD-powered as possible.

void PipelineThread::FragmentShadeQuad(CoverageMask* pMask, const SIMDEdgeCoefficients& simdEERegs)
    FragmentShader FS = m_pRenderEngine->m_FragmentShader;

    // Vertex attributes to be interpolated and passed to FS
    InterpolatedAttributes interpolatedAttribs;

    // Parameter interpolation basis functions
    __m128 ssef0XY, ssef1XY;

    // Calculate basis functions f0(x,y) & f1(x,y) once

    // Interpolate depth values prior to depth test
    __m128 sseZInterpolated = InterpolateDepthValues(pMask->m_PrimIdx, ssef0XY, ssef1XY);

    // Load current depth buffer contents
    __m128 sseDepthCurrent = m_pRenderEngine->FetchDepthBuffer(pMask->m_SampleX, pMask->m_SampleY);

    // Perform LESS_THAN_EQUAL depth test
    __m128 sseDepthRes = _mm_cmple_ps(sseZInterpolated, sseDepthCurrent);

    // Interpolate active vertex attributes
    InterpolateVertexAttributes(pMask->m_PrimIdx, ssef0XY, ssef1XY, &interpolatedAttribs);

    // 4-sample fragment colors
    FragmentOutput fragmentOutput;

    // Invoke FS and update color/depth buffer with fragment output
    FS(&interpolatedAttribs, m_pRenderEngine->m_pConstantBuffer, &fragmentOutput);

    // Generate color mask from 4-bit int mask set during rasterization
    __m128i sseColorMask = _mm_setr_epi32(
        pMask->m_QuadMask & g_scQuadMask0,
        pMask->m_QuadMask & g_scQuadMask1,
        pMask->m_QuadMask & g_scQuadMask2,
        pMask->m_QuadMask & g_scQuadMask3);

    sseColorMask = _mm_cmpeq_epi32(sseColorMask,
        _mm_set_epi64x(0x800000004, 0x200000001));

    // AND depth mask & coverage mask for quads of fragments
    __m128 sseWriteMask = _mm_and_ps(sseDepthRes, _mm_castsi128_ps(sseColorMask));

    // Write interpolated Z values
    m_pRenderEngine->UpdateDepthBuffer(sseWriteMask, sseZInterpolated, pMask->m_SampleX, pMask->m_SampleY);

    // Write fragment output
    m_pRenderEngine->UpdateColorBuffer(sseWriteMask, fragmentOutput, pMask->m_SampleX, pMask->m_SampleY);

After computing basis functions, we first interpolate Z values and perform depth test. Noting the result of depth test, we interpolate all other user-defined vertex attributes which we previously set up after VS, pass them into the FS routine which we invoke next and collect the 4-sample fragment color output. For tile/blocks, we have a single write mask which is the depth result, as we know for sure that all TA’d tiles/blocks are visible. For quad of fragments, we also use the coverage mask to mask-store depth and color values back to the frame buffer. And with that, we complete our journey into a simple tile-based rasterizer.


Now that we have completed a tour around the internals of the tile-based rasterizer, a few ideas to play with and ultimately improve it come to mind:

  • Playing with rasterizer configuration, trying out different tile sizes, number of threads, iteration size, etc. to see how it all affects the performance
  • Swapping TR results with TA results to see TR’d tiles being rendered
  • Injecting a tileIdx variable into shader metadata and passing it to FS to color mesh parts based on the index of tile in which they are rendered
  • Integrating AVX (or even AVX-512!) into the pipeline, see how it affects performance on each workload (spoiler: I did it for AVX, gives around 30-50% performance gain on most scenes)
  • Finding performance bottlenecks for different scenes and trying to optimize them
  • Just attaching debugger and tracing triangles in the pipeline step-by-step

Therefore, I highly recommend for everyone to compile the projects and give it a try! If you’d like to just see it all in action, find Hello, Triangle! sample here and Scene Rendering sample here (and assets pack here if you wish to render more complex scenes)

Mandatory Sponza scene of ~275k triangles rendered on my laptop w/ Intel i7 6700-HQ at around 60 ms


Most, if not all of this stuff was inspired/taken shamelessly/adapted by following articles/papers so if you want to ultimately know more or understand better, for reference:

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s