Mandelbrot in floating-point vs fixed-point

Feeling a lack of intuition about how a fixed-point implementation of a naturally floating-point algorithm would fare against native floating-point performance on modern GPUs, I set out to do a few simple experiments. As it stands, I don’t feel like having much insight on this currently, at least on a high-level and not i.e. u-arch details on a very specific workload. So, if you asked me about this right now, my reply would be “it depends” — as any good engineer without much idea (or too much of it!) about the topic at hand should do.

Simple experiments that I have done include fixed-point and floating-point implementations of well-known Mandelbrot set run on the CPU and GPU compute-only mode. Mandelbrot set is not only visually and mathematically interesting to deal with, but it’s also embarrassingly parallel where computation for each pixel is independent of one another and as such can be easily extended to a GPGPU implementation. All code can be found here.

CPU implementation is used both for single-threaded and multi-threaded comparisons as well as scalar vs SIMD (AVX2). GPU implementation similarly comprises of native floating-point and fixed-point compute shaders — nothing fancy, just dispatching a compute shader thread per pixel (localsize = 16) to evaluate Mandelbrot set inclusion. I didn’t strive to optimize any of this much, as isolated results of these super-micro benchmarks are not the point that I intend to make here, but overall comparisons. I have run the experiments on my laptop with Windows 10 which has an Intel i7-6700HQ CPU with a GEN9/SKL integrated-GPU (iGPU) and Nvidia 970M discrete-GPU (dGPU). Without further ado, here are the results that I put together:

A few quick observations that we could make here based upon these:

  • Fixed-point AVX2 implementation consistently remains the 2nd slowest code despite operating on 8 elements at a time in comparison to the slowest code, which is the single-threaded scalar native one. I have yet to profile it, but bottleneck most likely lies in my costly mul implementation, which requires extracting lower and upper halves of the AVX registers to handle overflow correctly.
  • Not surprisingly, both GPUs favor floating-point code over fixed-point by far, as GPU execution pipelines which run graphics-specific floating-point routines (e.g. shaders) more often than not, are heavily optimized to run floating-point operations.
  • Fixed-point math on the CPU side beats floating-point by good amount on single-threaded code, however in multi-threaded context, the difference is much smaller, if not the same or slightly worse.
  • In contrast, the difference between fixed-point vs floating-point with AVX2 code improves significantly as we go from single-thread to multiple-threads.
  • It gets increasingly difficult to beat the dGPU performance on the CPU side as the amount of work to be done increases. That is not the case with iGPU, though. 😦 However, this point owes a bit to the fact that the way I ran compute shader is not so typical, nor optimal GPU shader code.


Despite virtually all desktop and/or mobile processors having had FPUs for a long time now, fixed-point math has still its place inside computing world. For starters, an embedded CPU that one needs in a project might not have FPU to begin with or it might be costly to use it, in which case you can either emulate floating-point math or go with fixed-point. Another primary reason to use fixed-point is the necessity of deterministic and compact results on all platforms.

Let me know if you will run the code on different platform/processors and come across more interesting results!


I really encourage anyone checking this out to also compile (with slight modifications required here and there) and run it all on Linux with both GCC and Clang! For the record, I am aware of multiple shortcomings of MSVC leading to horrible SIMD code generation at times, but I had NOT expected 20-30% performance difference. Clang even beats GCC over fixed-point by a large margin.

Vulkan: Bad Practices

I’ve been working with Vulkan API for quite a while now, so I wanted to compile a non-exhaustive list of things that I find are bad practices with regards to performance, functionality and validity for Vulkan workloads.

As with most advice, the best thing to do could be to pass it on — or act upon it, in which case burden of investigating/profiling them is on you! Notice that they may or may not apply to your choice of API. With that said, here it goes.

Enabling all device features/extensions at all times

Vulkan API to create a logical Vulkan device is vkCreateDevice, which takes a pointer to enabled features and extensions. While from spec point of view, it’s perfectly legal to enable all extensions and features that a Vulkan-capable GPU supports, doing so may lead to performance loss and unnecessary allocation of feature/extension-related data structures in drivers. So, unless you intend to use a feature, don’t enable it! This bit is important and even asserted by the spec:

Some features, such as robustBufferAccessmay incur a run-time performance cost. Application writers should carefully consider the implications of enabling all supported features.

Overdoing command buffer recording

One of the best features that modern, explicit graphics APIs such as D3D12 and Vulkan offer to developers is the ability to truly multi-thread GPU commands recording and submission to HW queues.

While that sounds great, using hundreds of thousands of command buffers per-frame is no ideal. Memory pressure is one thing — for that various options are available for short-lived command buffers, albeit with a non-trivial overhead; e.g. vkTrimCommandPool and vkResetCommandPool. Bigger point of concern is missed optimization opportunities for drivers. The thing is, unlike OpenGL or D3D11 drivers, newer drivers don’t track active GPU configuration across threads — there is no single global context. N command buffers which program the same heavy-weight HW state in parallel would incur a cost bigger than if a more ideal number of command buffers were to be used.

Other side of the story is submission of such command buffers to HW queues. Queue submission is one of the most CPU overhead-incurring actions in a Vulkan driver, as noted by the spec:

Submission can be a high overhead operation, and applications should attempt to batch work together into as few calls to vkQueueSubmit as possible.

In many cases, submission of commands will require implicit flushes, synchronization overhead (e.g. paging queue, other HW contexts, etc.) which may cause bubbles in the pipeline. So, better be done as few times as possible by batching as the spec asserts.

Overdoing pipeline barriers just to be safe

I am not aware of any Validation layers that can detect unnecessary pipeline barriers (which on its own is a pretty difficult thing to achieve anyways), so these bad boys may go unnoticed easily. Presence of an unnecessary pipeline barrier is one thing, setting source/destination stages simply as VK_PIPELINE_STAGE_ALL_COMMANDS_BIT is the other. Again, drivers may need to take additional steps, in vain, simply because of presence of one unnecessary stage bit or access mask to be conservative. Note that the same applies also for pipeline barrier access scopes.

Don’t keep your barriers as broad as possible just for the sake of it. Otherwise, it will lead to missed opportunities for drivers to optimize for all sorts of data hazards.

Skipping Specialization Constants where applicable

That one may or may not be a trivial thing to do, given the complexities that some engines/applications employ regarding shader source -> binary code pipeline, however it’s a very nice feature that SPIR-V enables for us.

By paying the price of having to create multiple graphics/compute pipelines of the same configuration with all variants of a compile-time constant, you can easily turn dynamic uniform control flows into branch-free code.


  • Don’t presume support for any image format, device feature or extension, query it!
  • Don’t wait idle on the whole freaking queue because you want to know when some set of commands have completed execution, use finer-grained synchronization mechanisms that the API provides.
  • Be aware of minimum/maximum limits of whatever thing you make use of. If present, these’ll help you make more informed decisions ahead of time.


None of what I have presented above is endorsed by Intel, Khronos or any other group. They are merely my own personal observations.

A Short Introduction To SIMD-ifying Scalar Code

I finally took the opportunity that comes disguised as a lazy Sunday to write down a quick’n dirty post on how one could convert a super simple scalar function to SIMD for those who want to experiment with SIMD but feel intimidated by all those nasty-looking stuff. It’s mostly for absolute beginners, so it’ll be neither comprehensive nor eye-opening, but here it goes.

SIMD stands for Single Instruction Multiple Data where a SIMD-capable machine (e.g. a CPU or GPU) executes single instruction on multiple elements simultaneously. Thus, at least from a theoretical point of view, it offers a nice speed-up over traditional scalar code which operates on single element. Actual speed-up will vary depending on various factors.

In this post, we’ll use ubiquitous SSE instruction set, but I am hoping that by the end, converting the same code to AVX (Advanced Vector EXtensions) will be a simple job (and a nice exercise!). If you work with Visual Studio, be sure to check the VS Intrinsics Dude add-on out, which is a godsend for more serious SIMD development.

Now for the sake of this post, assume that we profiled our existing scalar code below and concluded with the not-so-nice fact that it’s been optimized to death and the only way to optimize it further is to put the idle SIMD execution units residing inside modern CPUs to work.

void ComputeResultsScalar(
    const int32_t* pAValues,
    const int32_t* pXValues,
    const int32_t* pBValues,
    int32_t* pResults)
    for (uint32_t i = 0; i < g_sNumInput; i++)
        // Load  a, x and b values
        int32_t a = pAValues[i];
        int32_t x = pXValues[i];
        int32_t b = pBValues[i];

        // Compute y = a * x + b
        int32_t y = a * x + b;

        int result;
        if (y > 0)
            result = 1;
        else if (y == 0)
            result = 0;
            result = -1;
        // Store result for single computation
        pResults[i] = result;

To re-iterate what the above scalar code does; it simply evaluates the linear equation y = ax + b and outputs a 1 if result is positive, -1 if result is negative, 0 otherwise for each input element. Now, let’s try to directly convert the above code to SIMD using SSE4 instruction set.

Whereas scalar code iterates over the input range one element at a time, with SSE we will be processing 4 elements at a time. The reason for this is that SSE instructions can operate on 4 single-precision or 32-bit integers at a time and we need 32-bit integers to compute our results. That is, we could have been just fine processing say, 16 8-bit chars at a time had it been suited to our use case. Another option would be to use AVX instruction set which has 8 lanes of single-precision FP or 32-bit integers or even AVX-512 which has 16 integer lanes!

void ComputeResultsSIMD4(
    const int32_t* pAValues,
    const int32_t* pXValues,
    const int32_t* pBValues,
    int32_t* pResults)
    for (uint32_t i = 0; i < g_sNumInput; i += g_scSIMDWidth)
        // Here be dragons

As in the scalar version, we start by fetching values for current iteration from input arrays:

__m128i sseA4 = _mm_load_si128(reinterpret_cast<const __m128i*>(pAValues + i));
__m128i sseX4 = _mm_load_si128(reinterpret_cast<const __m128i*>(pXValues + i));
__m128i sseB4 = _mm_load_si128(reinterpret_cast<const __m128i*>(pBValues + i));

mm_load_si128, as its name may suggest, loads 128 bits of data from memory specified via mem_addr parameter. What’s the returned variable, though? It’s not an int or a pointer to int (where 4 32-bit integers were loaded, maybe?). What’s going on is, SIMD-capable CPUs contain, in addition to general-purpose and architectural registers that are used for scalar code execution, SIMD registers. These registers are used to store multiple elements (4 32-bit integers or FP in case of SSE, for instance) during execution of SIMD operations. In x64 SSE, there are 16 of them, named XMM0, XMM1, XMM2, …, XMM15. Hence, data fetched from memory via _mm_load_si128 will be loaded in one of those registers.

It’d be also helpful to look at the SIMD instruction piece-wise: _mm prefix used here denotes it’s of SSE instruction set (well, MMX too) whereas load is the function and si128 is the type (integer) of data given instruction operates on. For example, identical instruction for single-precision FP would be _mm_load_ps where ps stands for packed single-precision. It helps quite a bit to internalize that format for quick iterations.

Now that input data is loaded, we can proceed with doing the actual computation y = a * x + b:

_m128i sseY4 = _mm_mullo_epi32(sseA4, sseX4);
sseY4 = _mm_add_epi32(sseY4, sseB4);

Apart from rather awkward and long function names to do what takes a single token in scalar land, it’s fairly straightforward if you try to parse it as above again. First, we multiply 4 a values with 4 x values and then add 4 b values to the result. That is, we call for first multiplication and then addition to happen on SIMD registers which contain a, x and b values for current iteration.

Now that we are done computing y (remember 4 of them!) values, we’d like to compare them to write out 1, 0 or -1 depending on their signs. How do we do that?!

You may have heard before that SIMD code doesn’t mesh well with branchy code. Here’s the reason. If we were to individually check each y to determine its sign in order to store expected value, that’d lead to 4 branches in each iteration! 4! Let alone the overhead of moving data between scalar and SIMD registers repeatedly, branches are one of the biggest enemies of high performance on modern OoO super-scalar CPUs to begin with, so it kind of defeats the purpose of using SIMD for performance (use of SIMD doesn’t come free either).

[Of course, that’s not to say that if your algorithm has unavoidable branches and you still need to use SIMD, you can’t. You sure can. Just be very cautious of plans to SIMD-ify branch-heavy scalar code.]

Hence, to combat the need to use conditionals in SIMD, some other tactics must better be utilized. Key to all of this is masking. What SIMD masks do principally is that they help us mask out values which won’t fit a given criteria (e.g. condition or predicate). As it happens, SSE and other SIMD instruction sets also include instructions which take a mask as input, such that they’d only perform the instruction only on those lanes for which given mask holds true. It’s one of those things that are easier done than said!

Below code uses the SSE comparison instructions to generate 3 masks. We need 3 masks, because we have 3 possible conditions for y to be: They can be positive, negative or equal to zero.

__m128i sseMaskPositive = _mm_cmpgt_epi32(sseY4, _mm_set1_epi32(0));
__m128i sseMaskZero = _mm_cmpeq_epi32(sseY4, _mm_set1_epi32(0));
__m128i sseMaskNegative = _mm_cmplt_epi32(sseY4, _mm_set1_epi32(0));

sseMaskPositive contains 0xFFFFFFFF on each lane where corresponding y value was positive (e.g. greater-than 0) and 0x0 where it was not (e.g. less-than or equal-to 0). Similarly, sseMaskZero contains 0xFFFFFFFF on each lane where corresponding y value was exactly zero, etc.

Now that masks are done, we store the values that we want by write-masking them with corresponding masks.

_mm_maskmoveu_si128(_mm_set1_epi32(1), sseMaskPositive, reinterpret_cast<char*>(pResults + i));
_mm_maskmoveu_si128(_mm_set1_epi32(0), sseMaskZero, reinterpret_cast<char*>(pResults + i));
_mm_maskmoveu_si128(_mm_set1_epi32(-1), sseMaskNegative, reinterpret_cast<char*>(pResults + i));

We strangely use _mm_maskmoveu_si128 instruction to do this. Why? It turns out and happens not so rarely in SIMD development, what we exactly want to do is not present in the instruction set — an instruction which’d take a mask to write 32-bit integers only on lanes where mask is non-zero, something like to __mm_maskstore_epi32. Instead, _mm_maskmoveu_si128 can be used to write-mask 16 8-bit integers. Notice the u suffix? It denotes that this instruction can handle unaligned data accesses.

I listed another method to achieve the exact same results in a slightly more optimal way, give it a look and try to understand why it works.

The Untold

Wrapping up our tour around Intro-To-SIMD land, a few things are left to be noted.

First and foremost, writing SIMD code manually can be mentally taxing, tiring, error-prone and up-front sub-optimal compared to scalar code which compilers are pretty good at to optimize. You won’t reach peak performance if you don’t transform your data to be in a cache-friendly layout.

Next, not only compilers are good at optimizing scalar code, they also happen to be moderately good at auto-vectorizing scalar code to SIMD. Focusing on having compiler(s) at hand succeed at this could yield a much better performance at the end of the day.

With that said and looking at generated assembly from our sample, you can’t really rely on that: . Here, you can see that while GCC is perfectly fine with converting the scalar code to SIMD — even if it has a branch in it, simple as it may be, MSVC (maybe unsurprisingly) fails do so. It interestingly succeeds at this version with 3-line change: which sort of proves my point! 🙂

A solution which covers most, if not all of these points is ISPC. Although I have never used it yet, it’s highly praised, found helpful and optimal.

The End

Find the code for this little experiment here:

Here are a few ideas for you to exercise what we just went through:

  • Find a third way to store 1, 0 or -1 depending on the sign of y
  • Implement the same scalar code with AVX instructions
  • Convert a slightly more complicated scalar code, maybe solve a quadratic equation?

Chasing Triangles in a Tile-based Rasterizer

If you followed along or just came across my Rasterization in One Weekend blog series without having actually compiled and run the demo workloads, you’d be in for a big surprise if I told you now how slow they turned out to be. I hinted at some of the existing techniques at the end that one could employ to speed up a painfully slow rasterizer, but that was all. Now is time to go ahead and get a taste of how they’d be applied in practice.

As part of this work, I implemented Tyler, a tile-based rasterizer which we will dissect below to understand better. It’s been my aim for this project that the rasterizer be scalable, configurable and understandable by people who’d want to know a little more about the subject and ultimately experiment with it. With that said, what I’d like to show here relies quite a bit on what I laid out in Rasterization in One Weekend, so it’d be the best if you read through it. I will try not to make big assumptions, however, there’ll be less hand-holding and more high-level explanations and I want not to repeat stuff that can be found there or elsewhere easily, in order to keep it as tidy as possible here. In any case, feel free to shoot your questions my way, I’d be happy to answer them.


Tile-based rendering (or tiled rendering) is an enhancement over traditional immediate-mode rendering where a render target (RT) is divided into tiles (i.e. sub-regions of a frame buffer), each of which contains primitives that can be rendered into the tiles separately.

Note the word separately here, as it emphasizes one of the biggest selling points of this technique over immediate-mode: Trapping all accesses inside a tile to color/depth buffers which reside in “slow” DRAM by utilizing an exclusive on-chip tile cache. It is of course no coincidence that such bandwidth-saving technology has been utilized the most by mobile low-power GPUs. See how in below diagram (top) vertices are processed, rasterized and shaded immediately and in bottom they are deferred?

Immediate-mode vs Tile-based architecture (PowerVR)

On a high level, for tile-based rendering to work, all what we need to do in addition to the usual machinery of a rasterizer is for each tile, to build a list of primitives which overlap a given tile and then shade all primitives in all tiles either one tile at a time or even better, in parallel. Of course, the complexity which we bypass with such simple definition, is in the details, so we will look at implementation details of Tyler in multiple parts below.

This rasterizer unsurprisingly only implements a small subset of modern 3D rendering pipeline and consists of following stages that triangles go through as they get moved from vertex buffers to be pixels on a screen:

  • Vertex Shader
  • Clipper (full-triangle only!)
  • Triangle Setup & Cull
  • Binner
  • Rasterizer
  • Fragment Shader

Tyler includes a pseudo API for rendering stuff which imitates how a proper graphics API would look like, so let’s review first how it can be used to set up the rasterizer and submit drawcalls before entering its bowels. Note that I’ll be referring to Scene Rendering sample most of the time, which loads a .OBJ scene file and renders multiple meshes visualizing scaled normals for simplicity and optionally presents in a window.

Rasterizer context setup
// Create and initialize the rasterizer rendering context
RenderContext* pRenderContext = new RenderContext(config);

// Allocate color & depth buffers
uint8_t* pColorBuffer = ... // Color buffer format == R8G8B8A8_UNORM
float* pDepthBuffer = ... // Depth buffer format == D32_FLOAT

// Set up main FBO
Framebuffer fbo = {};
fbo.m_pColorBuffer = pColorBuffer;
fbo.m_pDepthBuffer = pDepthBuffer;
fbo.m_Width = opt.m_ScreenWidth;
fbo.m_Height = opt.m_ScreenHeight;

// Single vec3 attribute is used to pass vertex normal VS -> FS
ShaderMetadata metadata = { 0, 1 /*vertex normal*/, 0 };

// Emulate passing of constants data to shaders
Constants cb;
cb.m_ModelViewProj = proj * view * model;

// We will have single index and vertex buffer to draw indexed mesh
std::vector<Vertex> vertexBuffer;
std::vector<uint32_t> indexBuffer;
// Store data of all scene objects to be drawn
std::vector<Mesh> objects;

// Load .OBJ scene model data and generate vertex/index buffers, etc.
InitializeSceneObjects(opt.m_OBJName, objects, vertexBuffer, indexBuffer);

// Bind FBO to be used in subsequent render pass once

// Bind VBO and set buffer stride
pRenderContext->BindVertexBuffer(, sizeof(Vertex));
// Bind IBO

// Bind shader constants

// Bind shader, constant buffer, texture(s)
pRenderContext->BindShaders(VS, FS, metadata);
Main rendering loop
                // Clear RT
                    true, /*clearColor*/
                    glm::vec4(0, 0, 0, 1) /*colorValue*/,
                    true, /*clearDepth*/
                    FLT_MAX /*depthValue*/);

                // Draw meshes
                for (uint32_t obj = 0; obj < objects.size(); obj++)
                    Mesh& mesh = objects[obj];

                    view = glm::lookAt(testParams.m_EyePos, testParams.m_LookAtPos, glm::vec3(0, 1, 0));
                    //model = glm::rotate(model, glm::radians(0.5f), glm::vec3(0, 1, 0));
                    cb.m_ModelViewProj = proj * view * model;

                    // Kick off draw. Note that it blocks callee until drawcall is completed
                    pRenderContext->DrawIndexed(mesh.m_IdxCount, mesh.m_IdxOffset);


There is no magic going on the application side: After rendering context is set up with default configuration, usual binding of frame buffer, vertex/index buffers, shaders and all that is done and objects are rendered one mesh at a time. Where the tile-based rasterizer does its thing and we’ll focus on is on line 18, where a drawcall is triggered. Notice the comment that the rasterizer blocks the caller thread until current drawcall is processed and complete. Obviously, that is a grave difference from how real graphics APIs and HW rasterizers actually work: They never process one drawcall at a time and block until running to completion; that’d be awfully inefficient. Instead, they process batches of commands which applications record in a command buffer and submit to the HW with help of graphics drivers. Primary reasons to do this are 1) to minimize CPU <-> GPU work scheduling overhead and 2) to increase utilization of GPU resources.

Now that we have skimmed through how the sample makes a drawcall, we’re ready to enter the rasterizer side of things. If you look at the project source directory, you can see quite a few more files this time around. However, most of the important stuff is basically contained within Pipeline Thread and Render Engine and the rest just provides auxiliary functionality. Render Engine handles drawcall setup and various resources needed for the tile-based rasterizer such as per-tile bins and coverage mask buffers and provides synchronization between Pipeline Threads, each of which spawns its own thread to execute the above-mentioned pipeline stages in parallel. So how does Render Engine prepare a drawcall for the pipeline threads to execute exactly?

// Prepare for next drawcall

uint32_t numRemainingPrims = primCount;

uint32_t drawElemsPrev = 0u;
uint32_t numIter = 0;

while (numRemainingPrims > 0)
    // Prepare for next draw iteration

    // How many prims are to be processed this iteration & prims per thread
    uint32_t iterationSize = (numRemainingPrims >= m_RenderConfig.m_MaxDrawIterationSize) ? m_RenderConfig.m_MaxDrawIterationSize : numRemainingPrims;
    uint32_t perIterationRemainder = iterationSize % m_RenderConfig.m_NumPipelineThreads;
    uint32_t primsPerThread = iterationSize / m_RenderConfig.m_NumPipelineThreads;

    for (uint32_t threadIdx = 0; threadIdx < m_RenderConfig.m_NumPipelineThreads; threadIdx++)
        uint32_t currentDrawElemsStart = drawElemsPrev;
        uint32_t currentDrawElemsEnd = (threadIdx == (m_RenderConfig.m_NumPipelineThreads - 1)) ?
            // If number of remaining primitives in iteration is not multiple of number of threads, have the last thread cover the remaining range
            (currentDrawElemsStart + primsPerThread + perIterationRemainder) :
            currentDrawElemsStart + primsPerThread;

        // Threads must have been initialized and idle by now!
        PipelineThread* pThread = m_PipelineThreads[threadIdx];

        // Assign computed draw elems range for thread
        pThread->m_ActiveDrawParams.m_ElemsStart = currentDrawElemsStart;
        pThread->m_ActiveDrawParams.m_ElemsEnd = currentDrawElemsEnd;
        pThread->m_ActiveDrawParams.m_VertexOffset = vertexOffset;
        pThread->m_ActiveDrawParams.m_IsIndexed = isIndexed;

        // PipelineThread drawcall input prepared, it can start processing of drawcall
        pThread->, std::memory_order_release);

        drawElemsPrev = currentDrawElemsEnd;
        numRemainingPrims -= (currentDrawElemsEnd - currentDrawElemsStart);

    // All threads are assigned draw parameters, let them work now, std::memory_order_release);

    // Stall main thread until all active threads complete given draw iteration

What we do here is that we partition incoming primitives equally among pipeline threads to scale the rasterizer as number of available threads go up. You may be asking, what’s with this iteration thing? It stems from the fact that we have no way of knowing whether next drawcall contains 6 triangles or 5 million triangles in advance and we absolutely don’t want to dynamically allocate or resize our internal buffers; that’d make the whole thing an offline rasterizer! Instead we set an upper bound (can be arbitrarily large, we’ll ponder on what could be a good iteration size) for our internal buffers, allocate them once and at draw time simply iterate over the ranges of primitives that are sliced per iteration, which is logically equal to doing it all in one iteration. That’s akin to how HW rasterizers handle this unknown, too.

Exemplary drawcall setup: 3 threads processing 6 primitives per iteration in parallel

A few corner cases such as number of primitives not being multiple of number of threads or even less than number of threads could be present, so we handle them here, too. Having an upper bound for our internal buffers is well worth it, but it also necessitates us to do little extra book-keeping like invalidating stale data in our internal data structures and caches before every iteration. One last thing to note here is that we want all threads to enter their respective pipelines no earlier than when there is at least one thread whose drawcall parameters are not assigned yet, to not cause any synchronization issues, so we use an atomic flag, m_DrawcallSetupComplete to stall the threads until the drawcall setup is done. After this, threads are green-lit and we wait until all iterations of current drawcall are processed by the threads to return execution back to the caller thread.

Now that we have seen how the threads get triggered to process a drawcall, we can take a closer look at the pipeline itself.

Geometry Processing

Geometry processing refers to handling of vertices that are fetched from user buffers and comprises of vertex-shading, clipping, triangle setup and binning stages. Remember that by this point all threads are going to be running in parallel!

Geometry processing whereby each thread iterates over primitives in its assigned range
for (uint32_t drawIdx = m_ActiveDrawParams.m_ElemsStart, primIdx = m_ActiveDrawParams.m_ElemsStart % m_RenderConfig.m_MaxDrawIterationSize;
    drawIdx < m_ActiveDrawParams.m_ElemsEnd;
    drawIdx++, primIdx++)
    // drawIdx = Assigned prim indices which will be only used to fetch indices
    // primIdx = Prim index relative to current iteration

    // Clip-space vertices to be retrieved from VS
    glm::vec4 v0Clip, v1Clip, v2Clip;

    // VS
    ExecuteVertexShader<IsIndexed>(drawIdx, primIdx, &v0Clip, &v1Clip, &v2Clip);

    // Bbox of the primitive which will be computed during clipping
    Rect2D bbox;

    // CLIPPER
    if (!ExecuteFullTriangleClipping(primIdx, v0Clip, v1Clip, v2Clip, &bbox))
        // Triangle clipped, proceed iteration with next primitive

    if (!ExecuteTriangleSetupAndCull(primIdx, v0Clip, v1Clip, v2Clip))
        // Triangle culled, proceed iteration with next primitive

    // BINNER
    ExecuteBinner(primIdx, v0Clip, v1Clip, v2Clip, bbox);

The pipeline, as before, starts with Vertex Shader, whose job it is to fetch vertices/indices from bound vertex/index buffers and invoke a user-defined VS routine that will return clip-space positions as well as feeding attributes to shaded vertices. In addition to that, it leverages a small vertex shader cache, which is exclusive to each thread. If a vertex of an indexed mesh will be present in the VS$, instead of re-invoking VS, cached clip-space position and vertex attributes will be copied from the cache directly, which can save up significant amount of computation on vertex-heavy workloads. Otherwise, we will invoke the VS as usual and place a copy of the returned vertex data in the VS$. After clip-space position and attributes are collected, we also call CalculateInterpolationCoefficients(), which will calculate and store interpolation data that we will need to interpolate vertex attributes before passing them onto the FS routine. Notice that we keep a copy of transformed vertices in the loop, it is crucial for efficiency that those be kept in L1/L2/L3 (best case -> worst case) while a thread works over the geometry stages.

After completing VS and collecting vertex attributes, we continue with clipping triangles against view frustum and computing bounding boxes of primitives, i.e. screen-space enclosing area. As in the name of the function, we clip primitives at whole triangle granularity. Great, but what does it mean?

Looking at above drawing of X-W plane (from Blinn’s Calculating Screen Coverage), the top dark gray region represents points in front of the eye, hence completely visible whereas the bottom region represents points behind the eye. What we check for in clipper is whether all three vertices of a triangle are inside the completely visible, aka Trivial-Accept (TA) region or completely outside, aka Trivial-Reject (TR) region. If TA, we can safely compute bounding box of the primitive by applying homogeneous division (e.g. divide-by-W), else if TR, we can safely discard the whole primitive. Otherwise primitive must have vertices in different parts of the view frustum, which means that it needs to be clipped against the planes and drawn partially. However, it’s a feature of the homogeneous rasterization algorithm that its scan conversion method can render even partially visible primitives, which is something detailed in previous posts and the original paper. Hence, if Must-Clip condition occurs (i.e. neither TR nor TA), we will set the bounding box (overly conservative!) as the whole screen extent.

After full-triangle clipping is done, we continue with Triangle Setup and Culling where we calculate the familiar edge equation coefficients. Contrary to Rasterization In One Weekend implementation, this time I implemented the optimizations to the original paper, which cleverly gets rid of the need to invert vertex matrix:

    // First, transform clip-space (x, y, z, w) vertices to device-space 2D homogeneous coordinates (x, y, w)
    const glm::vec4 v0Homogen = TO_HOMOGEN(v0Clip, fbWidth, fbHeight);
    const glm::vec4 v1Homogen = TO_HOMOGEN(v1Clip, fbWidth, fbHeight);
    const glm::vec4 v2Homogen = TO_HOMOGEN(v2Clip, fbWidth, fbHeight);

    // To calculate EE coefficients, we need to set up a "vertex matrix" and invert it
    // M = |  x0  x1  x2  |
    //     |  y0  y1  y2  |
    //     |  w0  w1  w2  |

    // Alternatively, we can rely on the following relation between an inverse and adjoint of a matrix: inv(M) = adj(M)/det(M)
    // Since we use homogeneous coordinates, it's sufficient to only compute adjoint matrix:
    // A = |  a0  b0  c0  |
    //     |  a1  b1  c1  |
    //     |  a2  b2  c2  |

    float a0 = (v2Homogen.y * v1Homogen.w) - (v1Homogen.y * v2Homogen.w);
    float a1 = (v0Homogen.y * v2Homogen.w) - (v2Homogen.y * v0Homogen.w);
    float a2 = (v1Homogen.y * v0Homogen.w) - (v0Homogen.y * v1Homogen.w);

    float b0 = (v1Homogen.x * v2Homogen.w) - (v2Homogen.x * v1Homogen.w);
    float b1 = (v2Homogen.x * v0Homogen.w) - (v0Homogen.x * v2Homogen.w);
    float b2 = (v0Homogen.x * v1Homogen.w) - (v1Homogen.x * v0Homogen.w);

    float c0 = (v2Homogen.x * v1Homogen.y) - (v1Homogen.x * v2Homogen.y);
    float c1 = (v0Homogen.x * v2Homogen.y) - (v2Homogen.x * v0Homogen.y);
    float c2 = (v1Homogen.x * v0Homogen.y) - (v0Homogen.x * v1Homogen.y);

    // Additionally,
    // det(M) == 0 -> degenerate/zero-area triangle
    // det(M) < 0  -> back-facing triangle
    float detM = (c0 * v0Homogen.w) + (c1 * v1Homogen.w) + (c2 * v2Homogen.w);

    // Assign computed EE coefficients for given primitive
    m_pRenderEngine->m_SetupBuffers.m_pEdgeCoefficients[3 * primIdx + 0] = { a0, b0, c0 };
    m_pRenderEngine->m_SetupBuffers.m_pEdgeCoefficients[3 * primIdx + 1] = { a1, b1, c1 };
    m_pRenderEngine->m_SetupBuffers.m_pEdgeCoefficients[3 * primIdx + 2] = { a2, b2, c2 };

Another benefit of this technique is that it lets us use the same edge equation coefficients for both scan conversion and attribute interpolation. What we did last time is that we computed a parameter interpolation vector for each distinct parameter, which could quickly bottleneck if you’ll have more attributes than just a few scalars, i.e. normals + texture coordinates. After getting Triangle Setup done, we reach a very critical part of the tile-based rasterizer, which is Binning (many concepts of which are explained beautifully by none other than Michael Abrash) stage. The role of the Binning in the pipeline is enormous: It needs to find which tiles cover which primitives, or put another way, which primitives overlap which tiles. Note that a tile is a sub-region of RT which comprises of 8×8 blocks by default.

For this, we will simply look at a single triangle and how it’d be processed, but remember, again, that all that is being done in parallel by multiple threads. And the way we achieve this is simple: For each tile, we allocate an array of overlapping primitives per thread. That is, by having each thread bin primitives in its own per-thread bin, we can keep them working simultaneously with no need to synchronize. That is also how we can preserve rendering order of primitives, by later going over binned primitives in thread order.

Colorful arrows represent edge normals while the enclosing rectangle is bounding box of the triangle.

Given a primitive, its bounding box and edge equation coefficients, the information that binning will provide us with is:

  • Trivial-Accept: Tile within triangle’s bounding box is completely covered by triangle
  • Trivial-Reject: Tile within triangle’s bounding box is completely outside triangle
  • Overlap: Tile within triangle’s bounding box is intersecting triangle

And how is that useful exactly? Assuming we choose a tile size of 64×64 pixels, trivially rejecting a tile means we can skip 64×64 per-pixel tests, all at once. Similarly, trivially accepting a tile means 64×64 pixels are visible, just draw the whole tile! Overlap is our least favorite result here, as it means that this tile and primitive need further processing in Rasterization stage where we will descent into block level and apply edge tests at finer level.

We start the binning process by first finding minimum and maximum tile indices that the bounding box of the triangle covers which we’ll loop over:

    // Given a tile size and frame buffer dimensions, find min/max range of the tiles that fall within bbox computed above
    // which we're going to iterate over, in order to determine if the primitive should be binned or not

    // Use floor(), min indices are inclusive
    uint32_t minTileX = static_cast<uint32_t>(glm::floor(bbox.m_MinX / m_RenderConfig.m_TileSize));
    uint32_t minTileY = static_cast<uint32_t>(glm::floor(bbox.m_MinY / m_RenderConfig.m_TileSize));

    // Use ceil(), max indices are exclusive
    uint32_t maxTileX = static_cast<uint32_t>(glm::ceil(bbox.m_MaxX / m_RenderConfig.m_TileSize));
    uint32_t maxTileY = static_cast<uint32_t>(glm::ceil(bbox.m_MaxY / m_RenderConfig.m_TileSize));

After fetching edge equation coefficients, we proceed with setting up TR and TA corners of each edge:

    // Fetch edge equation coefficients computed in triangle setup
    glm::vec3 ee0 = m_pRenderEngine->m_SetupBuffers.m_pEdgeCoefficients[3 * primIdx + 0];
    glm::vec3 ee1 = m_pRenderEngine->m_SetupBuffers.m_pEdgeCoefficients[3 * primIdx + 1];
    glm::vec3 ee2 = m_pRenderEngine->m_SetupBuffers.m_pEdgeCoefficients[3 * primIdx + 2];

    // Normalize edge functions
    ee0 /= (glm::abs(ee0.x) + glm::abs(ee0.y));
    ee1 /= (glm::abs(ee1.x) + glm::abs(ee1.y));
    ee2 /= (glm::abs(ee2.x) + glm::abs(ee2.y));

    // Indices of tile corners:
    // LL -> 0  LR -> 1
    // UL -> 2  UR -> 3

    static const glm::vec2 scTileCornerOffsets[] =
        { 0.f, 0.f},                                            // LL
        { m_RenderConfig.m_TileSize, 0.f },                     // LR
        { 0.f, m_RenderConfig.m_TileSize },                     // UL
        { m_RenderConfig.m_TileSize, m_RenderConfig.m_TileSize} // UR

    // (x, y) -> sample location | (a, b, c) -> edge equation coefficients
    // E(x, y) = (a * x) + (b * y) + c
    // E(x + s, y + t) = E(x, y) + (a * s) + (b * t)

    // Based on edge normal n=(a, b), set up tile TR corners for each edge
    const uint8_t edge0TRCorner = (ee0.y >= 0.f) ? ((ee0.x >= 0.f) ? 3u : 2u) : (ee0.x >= 0.f) ? 1u : 0u;
    const uint8_t edge1TRCorner = (ee1.y >= 0.f) ? ((ee1.x >= 0.f) ? 3u : 2u) : (ee1.x >= 0.f) ? 1u : 0u;
    const uint8_t edge2TRCorner = (ee2.y >= 0.f) ? ((ee2.x >= 0.f) ? 3u : 2u) : (ee2.x >= 0.f) ? 1u : 0u;

    // TA corner is the one diagonal from TR corner calculated above
    const uint8_t edge0TACorner = 3u - edge0TRCorner;
    const uint8_t edge1TACorner = 3u - edge1TRCorner;
    const uint8_t edge2TACorner = 3u - edge2TRCorner;
TR corner of each tile for edge 2 marked (X extends right, Y points downward!)

TR corner of a tile is the one that’s most inside a given edge whereas TA corner is the one that’s most outside. The reason we need TR and TA corners is that if we can conclude that TR corner of any edge is outside an edge, the whole tile must be outside the triangle, hence can be trivially rejected. Similarly, if TA corners of all edges are inside respective edges, the whole tile must be inside the triangle (or vice versa), hence can be trivially accepted.

The way we find these corners in above code fragment is one of the most important parts for the tile-based rasterizer: We select TR/TA corners based on the slope of edge normal=(a, b) (a, b being x and y components of the normal). In above figure, we can see that for edge 2 (e.g. blue arrow), TR corner for all tiles is marked lower right. Why? For edge 2, it holds that (a > 0), which means that the edge must extend towards left, which means one of the right corners (lower or upper) must be closer to the edge than left ones. Similarly, (b > 0) holds true, which means that TR corner must be the lower right corner. To find TA corner, we have two options: Either apply the same logic or rely on the fact that the furthest corner from TR corner will be most outside, which is diagonal from TR corner, which’d be more optimal.

// Iterate over calculated range of tiles
for (uint32_t ty = minTileY, tyy = 0; ty < maxTileY; ty++, tyy++)
    for (uint32_t tx = minTileX, txx = 0; tx < maxTileX; tx++, txx++)
        // Using EE coefficients calculated in TriangleSetup stage and positive half-space tests, determine one of three cases possible for each tile:
        // 1) TrivialReject -- tile within tri's bbox does not intersect tri -> move on
        // 2) TrivialAccept -- tile within tri's bbox is completely within tri -> emit a full-tile coverage mask
        // 3) Overlap       -- tile within tri's bbox intersects tri -> bin the triangle to given tile for further rasterization where block/pixel-level coverage masks will be emitted

        // (txx, tyy) = how many steps are done per dimension
        const float txxOffset = static_cast<float>(txx * m_RenderConfig.m_TileSize);
        const float tyyOffset = static_cast<float>(tyy * m_RenderConfig.m_TileSize);

        // Step from edge function computed above for the first tile in bbox
        float edgeFuncTR0 = edgeFunc0 + ((ee0.x * (scTileCornerOffsets[edge0TRCorner].x + txxOffset)) + (ee0.y * (scTileCornerOffsets[edge0TRCorner].y + tyyOffset)));
        float edgeFuncTR1 = edgeFunc1 + ((ee1.x * (scTileCornerOffsets[edge1TRCorner].x + txxOffset)) + (ee1.y * (scTileCornerOffsets[edge1TRCorner].y + tyyOffset)));
        float edgeFuncTR2 = edgeFunc2 + ((ee2.x * (scTileCornerOffsets[edge2TRCorner].x + txxOffset)) + (ee2.y * (scTileCornerOffsets[edge2TRCorner].y + tyyOffset)));

        // If TR corner of the tile is outside any edge, reject whole tile
        bool TRForEdge0 = (edgeFuncTR0 < 0.f);
        bool TRForEdge1 = (edgeFuncTR1 < 0.f);
        bool TRForEdge2 = (edgeFuncTR2 < 0.f);
        if (TRForEdge0 || TRForEdge1 || TRForEdge2)
            LOG("Tile %d TR'd by thread %d\n", m_pRenderEngine->GetGlobalTileIndex(tx, ty), m_ThreadIdx);

            // TrivialReject
            // Tile is completely outside of one or more edges
            // Tile is partially or completely inside one or more edges, do TrivialAccept tests first

            // Compute edge functions at TA corners based on edge function at first tile origin
            float edgeFuncTA0 = edgeFunc0 + ((ee0.x * (scTileCornerOffsets[edge0TACorner].x + txxOffset)) + (ee0.y * (scTileCornerOffsets[edge0TACorner].y + tyyOffset)));
            float edgeFuncTA1 = edgeFunc1 + ((ee1.x * (scTileCornerOffsets[edge1TACorner].x + txxOffset)) + (ee1.y * (scTileCornerOffsets[edge1TACorner].y + tyyOffset)));
            float edgeFuncTA2 = edgeFunc2 + ((ee2.x * (scTileCornerOffsets[edge2TACorner].x + txxOffset)) + (ee2.y * (scTileCornerOffsets[edge2TACorner].y + tyyOffset)));

            // If TA corner of the tile is outside all edges, accept whole tile
            bool TAForEdge0 = (edgeFuncTA0 >= 0.f);
            bool TAForEdge1 = (edgeFuncTA1 >= 0.f);
            bool TAForEdge2 = (edgeFuncTA2 >= 0.f);
            if (TAForEdge0 && TAForEdge1 && TAForEdge2)
                // TrivialAccept
                // Tile is completely inside of the triangle, no further rasterization is needed,
                // whole tile will be fragment-shaded!

                LOG("Tile %d TA'd by thread %d\n", m_pRenderEngine->GetGlobalTileIndex(tx, ty), m_ThreadIdx);

                // Append tile to the rasterizer queue
                m_pRenderEngine->EnqueueTileForRasterization(m_pRenderEngine->GetGlobalTileIndex(tx, ty));

                CoverageMask mask;
                mask.m_SampleX = static_cast<uint32_t>(tilePosX + txxOffset); // Based off of first tile position calculated above
                mask.m_SampleY = static_cast<uint32_t>(tilePosY + tyyOffset); // Based off of first tile position calculated above
                mask.m_PrimIdx = primIdx;
                mask.m_Type = CoverageMaskType::TILE;

                // Emit full-tile coverage mask
                    m_pRenderEngine->GetGlobalTileIndex(tx, ty),
                LOG("Tile %d binned by thread %d\n", m_pRenderEngine->GetGlobalTileIndex(tx, ty), m_ThreadIdx);

                // Overlap
                // Tile is partially covered by the triangle, bin the triangle for the tile
                    m_pRenderEngine->GetGlobalTileIndex(tx, ty),

Once we find TR and TA corners for all three edges, we enter the loop where we iterate over tiles within the range [minTile{X|Y}, maxTile{X|Y}) to apply inside/outside tests for each tile. That’s simply the same test we’ve seen before, only difference being that the test is being performed for tile TR/TA corners. If a tile is TR’d, we move on. If a tile is TA’d, we emit a coverage mask for the whole tile (and make sure the tile is in the rasterizer queue) and continue. In the unfortunate event that a tile happens to be overlapping a tile, we bin the primitive to tile’s per-thread bin, which is as simple as doing so:

void RenderEngine::BinPrimitiveForTile(uint32_t threadIdx, uint32_t tileIdx, uint32_t primIdx)
    // Add primIdx to the per-thread bin of a tile

    std::vector<uint32_t>& tileBin = m_BinList[tileIdx][threadIdx];

    if (tileBin.empty())
        // First encounter of primitive for tile, enqueue it for rasterization
        // Tile must have been already appended to the work queue

    // Append primIdx to the tile's bin
void RenderEngine::EnqueueTileForRasterization(uint32_t tileIdx)
    // Append the tile to the rasterizer queue if not already done
    if (!m_TileList[tileIdx].m_IsTileQueued.test_and_set(std::memory_order_acq_rel))
        // Tile not queued up for rasterization, do so now

Rasterizer queue is a simple FIFO of tile indices that are waiting to be rasterized (in case a tile was overlapping a primitive) or fragment-shaded (TA’d tiles) in next stages. With that we can wrap up the geometry processing and proceed with rasterization. For reference, here is Hello, Triangle! sample by default vs trivially-accepted tiles no-op’d:


Before any thread can proceed to the rasterizer stage, they stall until all threads have reached post-binning stage, which is necessary to establish that all primitives have been binned to tiles before rasterizer goes and does its thing. As soon as all threads complete all stages of geometry processing, they start processing tiles from the rasterizer queue.

// To preserve rendering order, we must ensure that all threads finish binning primitives to tiles
// before rasterization is started. To do that, we will stall all threads to sync @DRAWCALL_RASTERIZATION
// Set state to post binning and stall until all PipelineThreads complete binning, std::memory_order_release);

The essence of Rasterizer is remarkably simple if you grasped the idea of binning: We apply the same set of TR/TA tests at block level this time, descending from tile level:

Threads fetch next available tile index from the rasterizer queue and operate on given tile by applying above-mentioned tests for all primitives binned to this tile in parallel again.

As in the binning stage, using the bounding box, we find the minimum and maximum block indices which overlap the primitive, determine block TR/TA corners for all edges and loop over the range to see what blocks could be TR’d or TA’d or ignored altogether. If a block is TR’d, again, we move on. If it’s TA’d, we emit a coverage mask for the whole block and continue. Otherwise, we need to descend once more, into pixel level to perform edge tests to emit coverage masks for pixels. The nice thing is that we can rasterize multiple pixels in parallel by leveraging SIMD:

// Position of the block that we're testing at pixel level
float blockPosX = (firstBlockWithinBBoxX + bxxOffset);
float blockPosY = (firstBlockWithinBBoxY + byyOffset);

// Compute E(x, y) = (x * a) + (y * b) c at block origin once
__m128 sseEdge0FuncAtBlockOrigin = _mm_set1_ps(ee0.z + ((ee0.x * blockPosX) + (ee0.y * blockPosY)));
__m128 sseEdge1FuncAtBlockOrigin = _mm_set1_ps(ee1.z + ((ee1.x * blockPosX) + (ee1.y * blockPosY)));
__m128 sseEdge2FuncAtBlockOrigin = _mm_set1_ps(ee2.z + ((ee2.x * blockPosX) + (ee2.y * blockPosY)));

// Store edge 0 equation coefficients
__m128 sseEdge0A4 = _mm_set_ps1(ee0.x);
__m128 sseEdge0B4 = _mm_set_ps1(ee0.y);

// Store edge 1 equation coefficients
__m128 sseEdge1A4 = _mm_set_ps1(ee1.x);
__m128 sseEdge1B4 = _mm_set_ps1(ee1.y);

// Store edge 2 equation coefficients
__m128 sseEdge2A4 = _mm_set_ps1(ee2.x);
__m128 sseEdge2B4 = _mm_set_ps1(ee2.y);

// Generate masks used for tie-breaking rules (not to double-shade along shared edges)
__m128 sseEdge0A4PositiveOrB4NonNegativeA4Zero = _mm_or_ps(_mm_cmpgt_ps(sseEdge0A4, _mm_setzero_ps()),
    _mm_and_ps(_mm_cmpge_ps(sseEdge0B4, _mm_setzero_ps()), _mm_cmpeq_ps(sseEdge0A4, _mm_setzero_ps())));

__m128 sseEdge1A4PositiveOrB4NonNegativeA4Zero = _mm_or_ps(_mm_cmpgt_ps(sseEdge1A4, _mm_setzero_ps()),
    _mm_and_ps(_mm_cmpge_ps(sseEdge1B4, _mm_setzero_ps()), _mm_cmpeq_ps(sseEdge1A4, _mm_setzero_ps())));

__m128 sseEdge2A4PositiveOrB4NonNegativeA4Zero = _mm_or_ps(_mm_cmpgt_ps(sseEdge2A4, _mm_setzero_ps()),
    _mm_and_ps(_mm_cmpge_ps(sseEdge2B4, _mm_setzero_ps()), _mm_cmpeq_ps(sseEdge2A4, _mm_setzero_ps())));

for (uint32_t py = 0; py < g_scPixelBlockSize; py++)
    // Store Y positions in current row (all samples on the same row has the same Y position)
    __m128 sseY4 = _mm_set_ps1(py + 0.5f);

    for (uint32_t px = 0; px < g_scNumEdgeTestsPerRow; px++)
        // E(x, y) = (x * a) + (y * b) + c
        // E(x + s, y + t) = E(x, y) + s * a + t * b

        // Store X positions of 4 consecutive samples
        __m128 sseX4 = _mm_setr_ps(
            g_scSIMDWidth * px + 0.5f,
            g_scSIMDWidth * px + 1.5f,
            g_scSIMDWidth * px + 2.5f,
            g_scSIMDWidth * px + 3.5f);

        // a * s
        __m128 sseEdge0TermA = _mm_mul_ps(sseEdge0A4, sseX4);
        __m128 sseEdge1TermA = _mm_mul_ps(sseEdge1A4, sseX4);
        __m128 sseEdge2TermA = _mm_mul_ps(sseEdge2A4, sseX4);

        // b * t
        __m128 sseEdge0TermB = _mm_mul_ps(sseEdge0B4, sseY4);
        __m128 sseEdge1TermB = _mm_mul_ps(sseEdge1B4, sseY4);
        __m128 sseEdge2TermB = _mm_mul_ps(sseEdge2B4, sseY4);

        // E(x+s, y+t) = E(x,y) + a*s + t*b
        __m128 sseEdgeFunc0 = _mm_add_ps(sseEdge0FuncAtBlockOrigin, _mm_add_ps(sseEdge0TermA, sseEdge0TermB));
        __m128 sseEdgeFunc1 = _mm_add_ps(sseEdge1FuncAtBlockOrigin, _mm_add_ps(sseEdge1TermA, sseEdge1TermB));
        __m128 sseEdgeFunc2 = _mm_add_ps(sseEdge2FuncAtBlockOrigin, _mm_add_ps(sseEdge2TermA, sseEdge2TermB));

        //E(x, y):
        //    E(x, y) > 0
        //        ||
        //    !E(x, y) < 0 && (a > 0 || (a = 0 && b >= 0))

        // Edge 0 test
        __m128 sseEdge0Positive = _mm_cmpgt_ps(sseEdgeFunc0, _mm_setzero_ps());
        __m128 sseEdge0Negative = _mm_cmplt_ps(sseEdgeFunc0, _mm_setzero_ps());
        __m128 sseEdge0FuncMask = _mm_or_ps(sseEdge0Positive,
            _mm_andnot_ps(sseEdge0Negative, sseEdge0A4PositiveOrB4NonNegativeA4Zero));

        // Edge 1 test
        __m128 sseEdge1Positive = _mm_cmpgt_ps(sseEdgeFunc1, _mm_setzero_ps());
        __m128 sseEdge1Negative = _mm_cmplt_ps(sseEdgeFunc1, _mm_setzero_ps());
        __m128 sseEdge1FuncMask = _mm_or_ps(sseEdge1Positive,
            _mm_andnot_ps(sseEdge1Negative, sseEdge1A4PositiveOrB4NonNegativeA4Zero));

        // Edge 2 test
        __m128 sseEdge2Positive = _mm_cmpgt_ps(sseEdgeFunc2, _mm_setzero_ps());
        __m128 sseEdge2Negative = _mm_cmplt_ps(sseEdgeFunc2, _mm_setzero_ps());
        __m128 sseEdge2FuncMask = _mm_or_ps(sseEdge2Positive,
            _mm_andnot_ps(sseEdge2Negative, sseEdge2A4PositiveOrB4NonNegativeA4Zero));

        // Combine resulting masks of all three edges
        __m128 sseEdgeFuncResult = _mm_and_ps(sseEdge0FuncMask,
            _mm_and_ps(sseEdge1FuncMask, sseEdge2FuncMask));

        uint16_t maskInt = static_cast<uint16_t>(_mm_movemask_ps(sseEdgeFuncResult));

        // If at least one sample is visible, emit coverage mask for the tile
        if (maskInt != 0x0)
            // Quad mask points to the first sample
            CoverageMask mask;
            mask.m_SampleX = static_cast<uint32_t>(blockPosX + (g_scSIMDWidth * px));
            mask.m_SampleY = static_cast<uint32_t>(blockPosY + py);
            mask.m_PrimIdx = primIdx;
            mask.m_Type = CoverageMaskType::QUAD;
            mask.m_QuadMask = maskInt;

            // Emit a quad mask
            m_pRenderEngine->AppendCoverageMask(m_ThreadIdx, nextTileIdx, mask);

Assuming you’ll excuse this ugly SIMD soup (I blame partly MSVC for generating code that’s rather non-optimal for such trivial arithmetic operations, sigh), it should be fairly easy to follow here: We pick 4 pixels in a row of a block, which is 8×8 pixels by default, perform edge tests and generate a coverage mask for the 4-fragment group. If such group is found to contain at least one sample who’s visible (e.g if (maskInt != 0x0 check), we emit a coverage mask for this group of fragments and continue until we have iterated all pixels in a block. Rinse and repeat for all blocks in the range and there is our Rasterizer!

The tricky thing here is that it’ll only pay off to incur overhead for rasterization at such finer level as long as majority of the binned triangles in a tile will be bigger than a single block, as otherwise we end up throwing away all the computations done at higher levels, which is why HW implementations usually have multiple mitigations for small triangles.

Fragment Shading

Having reached post-rasterization stage, we encounter another synchronization point:

// Rasterization completed, set state to post raster and
// stall until all PipelineThreads complete rasterization.
// We need this sync because when (N-x) threads finish rasterization and
// reach the end of tile queue while x threads are still busy rasterizing tile blocks,
// we must ensure that none of the (N-x) non-busy threads will go ahead and start fragment-shading tiles
// whose blocks could be currently still rasterized by x remaining threads, std::memory_order_release);

Ensuring that all threads reach FS stage at the same time after stalling until all of them complete rasterization, we enter the last stage of the pipeline. As in rasterization stage, FS will operate on tiles fetched from the rasterizer queue in parallel by consuming coverage masks that we’ve emitted first in Binning and then Rasterization stages.

auto& currentSlot = pCoverageMaskBuffer->m_AllocationList[numAlloc];

for (uint32_t numMask = 0; numMask < currentSlot.m_AllocationCount; numMask++)
    ASSERT(pCoverageMaskBuffer->m_AllocationList[numAlloc].m_pData != nullptr);

    CoverageMask* pMask = &currentSlot.m_pData[numMask];

    // In many cases, next N coverage masks will have been generated for the same primitive
    // that we're fragment-shading at tile, block or fragment levels here,
    // it could be optimized so that the EE coefficients of the same primitive won't be fetched
    // from memory over and over again, unsure what gain, if anything it'd yield...

    // First fetch EE coefficients that will be used (in addition to edge in/out tests) for perspective-correct interpolation of vertex attributes
    const glm::vec3 ee0 = m_pRenderEngine->m_SetupBuffers.m_pEdgeCoefficients[3 * pMask->m_PrimIdx + 0];
    const glm::vec3 ee1 = m_pRenderEngine->m_SetupBuffers.m_pEdgeCoefficients[3 * pMask->m_PrimIdx + 1];
    const glm::vec3 ee2 = m_pRenderEngine->m_SetupBuffers.m_pEdgeCoefficients[3 * pMask->m_PrimIdx + 2];

    // Store edge 0 coefficients
    __m128 sseA4Edge0 = _mm_set_ps1(ee0.x);
    __m128 sseB4Edge0 = _mm_set_ps1(ee0.y);
    __m128 sseC4Edge0 = _mm_set_ps1(ee0.z);

    // Store edge 1 equation coefficients
    __m128 sseA4Edge1 = _mm_set_ps1(ee1.x);
    __m128 sseB4Edge1 = _mm_set_ps1(ee1.y);
    __m128 sseC4Edge1 = _mm_set_ps1(ee1.z);

    // Store edge 2 equation coefficients
    __m128 sseA4Edge2 = _mm_set_ps1(ee2.x);
    __m128 sseB4Edge2 = _mm_set_ps1(ee2.y);
    __m128 sseC4Edge2 = _mm_set_ps1(ee2.z);

    const SIMDEdgeCoefficients simdEERegs =

    switch (pMask->m_Type)
    case CoverageMaskType::TILE:
        LOG("Thread %d fragment-shading tile %d\n", m_ThreadIdx, nextTileIdx);
        FragmentShadeTile(pMask->m_SampleX, pMask->m_SampleY, pMask->m_PrimIdx, simdEERegs);
    case CoverageMaskType::BLOCK:
        LOG("Thread %d fragment-shading blocks\n", m_ThreadIdx);
        FragmentShadeBlock(pMask->m_SampleX, pMask->m_SampleY, pMask->m_PrimIdx, simdEERegs);
    case CoverageMaskType::QUAD:
        LOG("Thread %d fragment-shading coverage masks\n", m_ThreadIdx, ee0, ee1, ee2);
        FragmentShadeQuad(pMask, simdEERegs);

Before we invoke the user-defined FS routine, first we compute what’s called parameter basis functions, which is a direct implementation of the method described in the paper I’ve linked before. The essence of this is that instead of computing an interpolation vector per parameter, they find basis functions by which any parameter can be (perspective-correct) interpolated:

void PipelineThread::ComputeParameterBasisFunctions(
    uint32_t sampleX,
    uint32_t sampleY,
    const SIMDEdgeCoefficients& simdEERegs,
    __m128* pSSEf0XY,
    __m128* pSSEf1XY)
    // R(x, y) = F0(x, y) + F1(x, y) + F2(x, y)
    // r = 1/(F0(x, y) + F1(x, y) + F2(x, y))

    // Store X positions of 4 consecutive samples
    __m128 sseX4 = _mm_setr_ps(
        sampleX + 0.5f,
        sampleX + 1.5f,
        sampleX + 2.5f,
        sampleX + 3.5f); // x x+1 x+2 x+3

    // Store Y positions of 4 samples in a row (constant)
    __m128 sseY4 = _mm_set_ps1(sampleY); // y y y y

    // Compute F0(x,y)
    __m128 sseF0XY4 = _mm_add_ps(simdEERegs.m_SSEC4Edge0,
            _mm_mul_ps(sseY4, simdEERegs.m_SSEB4Edge0),
            _mm_mul_ps(sseX4, simdEERegs.m_SSEA4Edge0)));

    // Compute F1(x,y)
    __m128 sseF1XY4 = _mm_add_ps(simdEERegs.m_SSEC4Edge1,
            _mm_mul_ps(sseY4, simdEERegs.m_SSEB4Edge1),
            _mm_mul_ps(sseX4, simdEERegs.m_SSEA4Edge1)));

    // Compute F2(x,y)
    __m128 sseF2XY4 = _mm_add_ps(simdEERegs.m_SSEC4Edge2,
            _mm_mul_ps(sseY4, simdEERegs.m_SSEB4Edge2),
            _mm_mul_ps(sseX4, simdEERegs.m_SSEA4Edge2)));

    // Compute F(x,y) = F0(x,y) + F1(x,y) + F2(x,y)
    __m128 sseR4 = _mm_add_ps(sseF2XY4, _mm_add_ps(sseF0XY4, sseF1XY4));

    // Compute perspective correction factor
    sseR4 = _mm_rcp_ps(sseR4);

    // Assign final f0(x,y) & f1(x,y)
    *pSSEf0XY = _mm_mul_ps(sseR4, sseF0XY4);
    *pSSEf1XY = _mm_mul_ps(sseR4, sseF1XY4);

    // Basis functions f0, f1, f2 sum to 1, e.g. f0(x,y) + f1(x,y) + f2(x,y) = 1 so we'll skip computing f2(x,y) explicitly

Implementation for different coverage masks (tile/block/quads) is almost the same so I will refer to the quad mask for completeness. Note that this is the tightest code path, so we try to keep as SIMD-powered as possible.

void PipelineThread::FragmentShadeQuad(CoverageMask* pMask, const SIMDEdgeCoefficients& simdEERegs)
    FragmentShader FS = m_pRenderEngine->m_FragmentShader;

    // Vertex attributes to be interpolated and passed to FS
    InterpolatedAttributes interpolatedAttribs;

    // Parameter interpolation basis functions
    __m128 ssef0XY, ssef1XY;

    // Calculate basis functions f0(x,y) & f1(x,y) once

    // Interpolate depth values prior to depth test
    __m128 sseZInterpolated = InterpolateDepthValues(pMask->m_PrimIdx, ssef0XY, ssef1XY);

    // Load current depth buffer contents
    __m128 sseDepthCurrent = m_pRenderEngine->FetchDepthBuffer(pMask->m_SampleX, pMask->m_SampleY);

    // Perform LESS_THAN_EQUAL depth test
    __m128 sseDepthRes = _mm_cmple_ps(sseZInterpolated, sseDepthCurrent);

    // Interpolate active vertex attributes
    InterpolateVertexAttributes(pMask->m_PrimIdx, ssef0XY, ssef1XY, &interpolatedAttribs);

    // 4-sample fragment colors
    FragmentOutput fragmentOutput;

    // Invoke FS and update color/depth buffer with fragment output
    FS(&interpolatedAttribs, m_pRenderEngine->m_pConstantBuffer, &fragmentOutput);

    // Generate color mask from 4-bit int mask set during rasterization
    __m128i sseColorMask = _mm_setr_epi32(
        pMask->m_QuadMask & g_scQuadMask0,
        pMask->m_QuadMask & g_scQuadMask1,
        pMask->m_QuadMask & g_scQuadMask2,
        pMask->m_QuadMask & g_scQuadMask3);

    sseColorMask = _mm_cmpeq_epi32(sseColorMask,
        _mm_set_epi64x(0x800000004, 0x200000001));

    // AND depth mask & coverage mask for quads of fragments
    __m128 sseWriteMask = _mm_and_ps(sseDepthRes, _mm_castsi128_ps(sseColorMask));

    // Write interpolated Z values
    m_pRenderEngine->UpdateDepthBuffer(sseWriteMask, sseZInterpolated, pMask->m_SampleX, pMask->m_SampleY);

    // Write fragment output
    m_pRenderEngine->UpdateColorBuffer(sseWriteMask, fragmentOutput, pMask->m_SampleX, pMask->m_SampleY);

After computing basis functions, we first interpolate Z values and perform depth test. Noting the result of depth test, we interpolate all other user-defined vertex attributes which we previously set up after VS, pass them into the FS routine which we invoke next and collect the 4-sample fragment color output. For tile/blocks, we have a single write mask which is the depth result, as we know for sure that all TA’d tiles/blocks are visible. For quad of fragments, we also use the coverage mask to mask-store depth and color values back to the frame buffer. And with that, we complete our journey into a simple tile-based rasterizer.


Now that we have completed a tour around the internals of the tile-based rasterizer, a few ideas to play with and ultimately improve it come to mind:

  • Playing with rasterizer configuration, trying out different tile sizes, number of threads, iteration size, etc. to see how it all affects the performance
  • Swapping TR results with TA results to see TR’d tiles being rendered
  • Injecting a tileIdx variable into shader metadata and passing it to FS to color mesh parts based on the index of tile in which they are rendered
  • Integrating AVX (or even AVX-512!) into the pipeline, see how it affects performance on each workload (spoiler: I did it for AVX, gives around 30-50% performance gain on most scenes)
  • Finding performance bottlenecks for different scenes and trying to optimize them
  • Just attaching debugger and tracing triangles in the pipeline step-by-step

Therefore, I highly recommend for everyone to compile the projects and give it a try! If you’d like to just see it all in action, find Hello, Triangle! sample here and Scene Rendering sample here (and assets pack here if you wish to render more complex scenes)

Mandatory Sponza scene of ~275k triangles rendered on my laptop w/ Intel i7 6700-HQ at around 60 ms


Most, if not all of this stuff was inspired/taken shamelessly/adapted by following articles/papers so if you want to ultimately know more or understand better, for reference:

Rasterization in One Weekend, Part III

Code for this part can be found here.

Go Wild!

In Part II, we saw how we could utilize the algorithm for rendering 3D objects without perspective-divide with aid of a depth buffer. We left off saying that next is vertex attributes and perspective-correct interpolation so let’s get to it!

If you are wondering what the heck vertex attributes are to begin with: A vertex attribute is simply any user-defined value that is attached to each vertex of an object, such as color, normal, texture coordinates. It’s simply up to you how you define them and put them to use by passing between shaders. For instance, using per-fragment normals and texture coordinates, one could implement normal mapping to improve objects’ appearances. Remember, they are completely optional and not required for rasterization algorithm itself.

As we have seen in earlier posts, by rasterizing objects, we transform them from 3D object-space to camera space first (applying perspective projection in the process) and then all the way down to 2D screen-space. As such, above-mentioned vertex attributes need to be interpolated across a triangle in screen-space so that we can have these values available at all fragments inside a triangle as we shade them.

Thankfully, to interpolate these values in screen-space, we don’t need to start anew: It’s all the same process & properties that we used to set up edge functions in Part I  that we are going to extend here. Put another way, edge functions are just another application of the same algorithm that is used to interpolate parameters (which must not come as a shock to anyone who’s read the paper I’ve linked) so let’s revisit the original paper once again:

That last sentence is the catch here — the coefficients a, b, and c are the same in all three equations. Which means that, once we calculate those coefficients a, b and c, we can compute u/w value in screen-space using (3).

In Part I, that’s exactly how we set up our edge functions. First, we said that we have three edge functions defined at each vertex of a triangle:

And then went onto calculating those coefficients by using what we called vertex matrix and interpolation constraints, which is simply that each edge function is constant and equal to 1 on one edge while 0 at opposite vertex:

And lastly, by inverting vertex matrix on the left and multiplying it by our interpolation constraints on the right, we ended up with what we were after: the coefficients of our edge functions, which we could feed into eq. (3) above now (i.e. what we do as we loop over all pixels and compute edge functions).

Notice that for this to work, division by w (aka homogeneous division) is applied to eq. (2). That is the property that gives us the ability to interpolate parameters across a triangle in screen-space!

If it’s not clear: there is nothing special about our edge functions that makes this all work: Any three values defined at each vertex of a triangle will do. One very important fact to note here is that, after applying homogeneous division to eq. (2), we end up with functions of pixel coordinates (X, Y) and the values (or results/outputs of functions) themselves are multiples of 1/w. You may wonder, how did this work in previous parts then? The answer is simple: We simply didn’t care because all we were interested in was that the edge functions tell us whether a point is inside or outside a triangle by comparing signs and not values themselves! To recover the true parameter values (e.g. u and not just u/w), we are going to make use of what we just made up and dubbed constant function in Part II. How, you say? As I alluded to it, we can make up and attach any value to vertices of triangles that we want to rasterize. So, let’s make one and assign constant value 1 to every vertex:

By using the same mechanism of applying homogeneous division to the parameter function, we end up with:

Oh, that’s great! It’s a function of pixel coordinates again so we can just interpolate it at every fragment (as for all other parameters) as we rasterize triangles and divide our parameters by 1/w value to recover the values themselves:

Note how we interpolate both parts — u/w and 1/w separately to interpolate function u(x/w, y/w) correctly.

To convert this theoretical knowledge into practice, we are going to render a more complex scene this time, the well-known Sponza scene, which has about ~260k textured triangles with texture coordinates and per-vertex normals so let’s get into calculating those interpolation coefficients and functions we’ve just discussed:

// Interpolate 1/w at current fragment
float oneOverW = (C.x * sample.x) + (C.y * sample.y) + C.z;

// w = 1/(1/w)
float w = 1.f / oneOverW;

This time around, we also use z-buffer instead of 1/w to perform depth test in order to resolve visibility so let’s interpolate and recover it, too:

// Interpolate z that will be used for depth test
float zOverW = (Z.x * sample.x) + (Z.y * sample.y) + Z.z;
float z = zOverW * w;

Handling of normals:

// Calculate normal interpolation vector
glm::vec3 PNX = M * glm::vec3(fi0.Normal.x, fi1.Normal.x, fi2.Normal.x);
glm::vec3 PNY = M * glm::vec3(fi0.Normal.y, fi1.Normal.y, fi2.Normal.y);
glm::vec3 PNZ = M * glm::vec3(fi0.Normal.z, fi1.Normal.z, fi2.Normal.z);
// Interpolate normal
float nxOverW = (PNX.x * sample.x) + (PNX.y * sample.y) + PNX.z;
float nyOverW = (PNY.x * sample.x) + (PNY.y * sample.y) + PNY.z;
float nzOverW = (PNZ.x * sample.x) + (PNZ.y * sample.y) + PNZ.z;

glm::vec3 normal = glm::vec3(nxOverW, nyOverW, nzOverW) * w; // {nx/w, ny/w, nz/w} * w -> { nx, ny, nz}

Handling of texture coordinates (or UVs):

// Calculate UV interpolation vector
glm::vec3 PUVS = M * glm::vec3(fi0.TexCoords.s, fi1.TexCoords.s, fi2.TexCoords.s);
glm::vec3 PUVT = M * glm::vec3(fi0.TexCoords.t, fi1.TexCoords.t, fi2.TexCoords.t);
// Interpolate texture coordinates
float uOverW = (PUVS.x * sample.x) + (PUVS.y * sample.y) + PUVS.z;
float vOverW = (PUVT.x * sample.x) + (PUVT.y * sample.y) + PUVT.z;

glm::vec2 texCoords = glm::vec2(uOverW, vOverW) * w; // {u/w, v/w} * w -> {u, v}

As we did in Part II, we will feed all the vertex data that we fetch from the vertex buffer as input to Vertex Shader and return from it a vec4 position which will be used to transform a triangle:

// Vertex Shader to apply perspective projections and also pass vertex attributes to Fragment Shader
glm::vec4 VS(const VertexInput& input, const glm::mat4& MVP, FragmentInput& output)
    // Simply pass normal and texture coordinates directly to FS
    output.Normal = input.Normal;
    output.TexCoords = input.TexCoords;

   // Output a clip-space vec4 that will be used to rasterize parent triangle
   return (MVP * glm::vec4(input.Pos, 1.0f));

Now that we are done handling all parameters, we can feed all that data output from Vertex Shader into what usually is called Fragment Shader (or Pixel Shader in DX parlance) as input, execute the fragment shader and write the new color value at given fragment:

// Pass interpolated normal & texture coordinates to FS
FragmentInput fsInput = { normal, texCoords };

// Invoke fragment shader to output a color for each fragment
glm::vec3 outputColor = FS(fsInput, pTexture);

// Write new color at this fragment
frameBuffer[x + y * g_scWidth] = outputColor;

Our Fragment Shader will consist of code to fetch texel (i.e. texture elements) values from the loaded texture of a primitive using its texture coordinates, which we pass to it from our Vertex Shader after interpolating correctly (as we just did above):

// Fragment Shader that will be run at every visible pixel on triangles to shade fragments
glm::vec3 FS(const FragmentInput& input, Texture* pTexture)
#if 1 // Render textured polygons

	// By using fractional part of texture coordinates only, we will REPEAT (or WRAP) the same texture multiple times
	uint32_t idxS = static_cast<uint32_t>((input.TexCoords.s - static_cast<int64_t>(input.TexCoords.s)) * pTexture->m_Width - 0.5f);
	uint32_t idxT = static_cast<uint32_t>((input.TexCoords.t - static_cast<int64_t>(input.TexCoords.t)) * pTexture->m_Height - 0.5f);
	uint32_t idx = (idxT * pTexture->m_Width + idxS) * pTexture->m_NumChannels;

	float r = static_cast<float>(pTexture->m_Data[idx++] * (1.f / 255));
	float g = static_cast<float>(pTexture->m_Data[idx++] * (1.f / 255));
	float b = static_cast<float>(pTexture->m_Data[idx++] * (1.f / 255));

	return glm::vec3(r, g, b);
#else // Render interpolated normals
	return (input.Normals) * glm::vec3(0.5) + glm::vec3(0.5); // transform normal values [-1, 1] -> [0, 1] to visualize better

Texturing alone is a complex topic with all its tricky details like address control modes, filtering, aliasing and whatnot so I don’t want to get into this and get lost. Put simply, it’s the process of mapping provided texture coordinates to image coordinates and looking up texels from image resource to texture polygons as they get shaded. Many resources can be found created by far more knowledgeable people on the topic than me, so I hope that you’ll excuse my severe hand-waving here!

If you followed along without any issues, you must be able to render a frame as follows:

Sponza with interpolated normals
Sponza with textured polygons

If you run the project right away to render one frame of Part III, you might be surprised how utterly slow it turned out to be, so let’s quickly mention some optimization opportunities. In the meanwhile, you may want to profile the source code and see where bottlenecks lie, maybe.

One of the biggest performance killers in our little rasterizer is that, even if many triangles cover a relatively small region on screen, we blindly loop over all pixels to check which ones overlap a given triangle. This could be optimized heavily by what’s called binning. What the process of binning does is simple: Partition the whole frame buffer into sub-regions and discard those sub-regions that do not overlap triangles all at once.

Second optimization is elegantly simple, once you discover it: Instead of evaluating edge functions at every fragment, we could compute it once and rely on the following property, as noted by Juan Pineda in A Parallel Algorithm for Polygon Rasterization in order to evaluate edge functions incrementally:

Third point is that if you look at DrawIndexed() function, you should see that all the computations for one triangle are independent from one another, with two exceptions: performing depth test on a global depth buffer and if it passes, writing new color value to global frame buffer, which should give you enough clue for how you could go about drawing many triangles (be it of same primitive or different one) in parallel. Of course, as with many things, in real-world rasterization, some restrictions, one of which is primitive order would apply.

Fourth point of optimization is detailed in Incremental and Hierarchical Hilbert Order Edge Equation Polygon Rasterization.


Phew, that was quite a journey! Starting from the foundation of the algorithm in Part I, we saw how we could render a single triangle. In Part II, we examined how the algorithm could be extended to 3D. And finally in Part III, we discussed how to utilize the same algorithm to handle vertex attributes & interpolation and finished off rendering a non-trivial scene with many polygons correctly. Thanks & congratulations to all who made it to the end with me!

Rasterization in One Weekend, Part II

Code for this part can be found here.

Go 3D!

In Part I: Hello, Triangle!, we rendered a two-dimensional triangle defined in normalized device coordinates-space [-1, 1] directly. As simple as they can get, it was already perfect for projecting to our two-dimensional screen. Before we get into adding a third dimension to our virtual world, let’s have a quick refresher on various spaces and coordinate systems used in computer graphics, again referencing the awesome resource that is

Local-space or object-space is where objects created in a 3D software such as Autodesk Maya or Blender, are usually modeled in. It’s simply an intermediary space that allows you to decouple objects’ position/scaling/rotation from their final location in your virtual world/camera/screen. That’s what we’ll do in this part, too. We’re gonna use vertex data of a single cube in object-space and apply different transformations to render multiple cubes. Here is our pseudo vertex and index buffers:

        // Setup vertices & indices to draw an indexed cube
        glm::vec3 vertices[] =
            { 1.0f, -1.0f, -1.0f },
            { 1.0f, -1.0f, 1.0f },
            { -1.0f, -1.0f, 1.0f },
            { -1.0f, -1.0f, -1.0f },
            { 1.0f, 1.0f, -1.0f },
            {  1.0f, 1.0f, 1.0f },
            { -1.0f, 1.0f, 1.0f },
            { -1.0f, 1.0f, -1.0f },

        uint32_t indices[] =
            1,3,0, 7,5,4, 4,1,0, 5,2,1, 2,7,3, 0,7,4, 1,2,3, 7,6,5, 4,5,1, 5,6,2, 2,6,7, 0,3,7

If the idea of using an index buffer is new to you: It’s simply a look-up table that maps indices to actual vertices. For instance, If you look at a geometric shape like cube, you can see many vertices that are shared by multiple triangles. So the idea here is simply that, instead of storing the same vertices over and over again, we can build a buffer of unique vertices and look up real vertices at run-time using the index buffer while rendering the objects. Index buffers are fundamental for saving buffer space and accelerating drawing complex objects in 3D. Note how this time, we have real z coordinates.

Okay, we have our vertex & index buffers set up, let’s apply some random rotation and translations on them to render multiple cubes in 3D so that we can test our implementation and have something more than a dull triangle to show off:

        // Let's draw multiple objects
        std::vector<glm::mat4> objects;
    void InitializeSceneObjects(std::vector<glm::mat4>& objects)
        // Construct a scene of few cubes randomly positioned

        const glm::mat4 identity(1.f);

        glm::mat4 M0 = glm::translate(identity, glm::vec3(0, 0, 2.f));
        M0 = glm::rotate(M0, glm::radians(45.f), glm::vec3(0, 1, 0));

        glm::mat4 M1 = glm::translate(identity, glm::vec3(-3.75f, 0, 0));
        M1 = glm::rotate(M1, glm::radians(30.f), glm::vec3(1, 0, 0));

        glm::mat4 M2 = glm::translate(identity, glm::vec3(3.75f, 0, 0));
        M2 = glm::rotate(M2, glm::radians(60.f), glm::vec3(0, 1, 0));

        glm::mat4 M3 = glm::translate(identity, glm::vec3(0, 0, -2.f));
        M3 = glm::rotate(M3, glm::radians(90.f), glm::vec3(0, 0, 1));

What we add to array of objects is here basically the model matrix of each cube that we want to render. What does a model matrix do? They simply transform objects that are defined in object-space to the 3D world-space. By applying a different model matrix to the same vertex buffer that we defined above in object-space, we will be able to have multiple cubes rendered simultaneously.

Now that we positioned our 3D objects in our virtual world, let’s look at next how to build view and projection matrices.

        // Build view & projection matrices (right-handed sysem)
        float nearPlane = 0.1f;
        float farPlane = 100.f;
        glm::vec3 eye(0, 3.75, 6.5);
        glm::vec3 lookat(0, 0, 0);
        glm::vec3 up(0, 1, 0);

        glm::mat4 view = glm::lookAt(eye, lookat, up);
        glm::mat4 proj = glm::perspective(glm::radians(60.f), (float)g_scWidth / (float)g_scHeight, nearPlane, farPlane);

View (or camera) and projection matrices will allow us to view our virtual world in 3D similar to how we see the real world, in perspective. By applying view and projection matrices to each vertex, we will transform them from world-space (remember, they were just moved from object-space to world-space by use of model matrix!) to screen-space.

So in essence, the purpose of all these transformations is to project objects to image-plane where w=1. Let’s get into the details of how to do that with inclusion of third dimension now.

Unlike Part I, where we had a single triangle which consists of three vertices to render, we’re gonna render multiple cubes consisting of 12 triangles each this time around, so two more nested loops come in:

        // Loop over objects in the scene
        for (size_t n = 0; n < objects.size(); n++)
            // Loop over triangles in a given object and rasterize them one by one
            for (uint32_t idx = 0; idx < ARR_SIZE(indices) / 3; idx++)

Now is the time to put those vertex & index buffers we’ve allocated above to use by fetching vertices of each triangle that we’re trying to render:

            // We're gonna fetch object-space vertices from the vertex buffer indexed by the values in index buffer
            // and pass them directly to each VS invocation
            const glm::vec3& v0 = vertices[indices[idx * 3]];
            const glm::vec3& v1 = vertices[indices[idx * 3 + 1]];
            const glm::vec3& v2 = vertices[indices[idx * 3 + 2]];

If you are already familiar with the concept of shaders, you must know what we should do next: Invoke a vertex shader to have each vertex (that we’ve just fetched) transformed to clip-space.

            // Invoke function for each vertex of the triangle to transform them from object-space to clip-space (-w, w)
            glm::vec4 v0Clip = VS(v0, objects[n], view, proj);
            glm::vec4 v1Clip = VS(v1, objects[n], view, proj);
            glm::vec4 v2Clip = VS(v2, objects[n], view, proj);

Why do we do that, you may ask if you don’t know about shaders. We already hinted at why: To apply all the necessary transformations that move objects from object-space to clip-space, which is simply what we’ll do in VS() call:

glm::vec4 VS(const glm::vec3& pos, const glm::mat4& M, const glm::mat4& V, const glm::mat4& P)
    return (P * V * M * glm::vec4(pos, 1.0f));

Notice how we pass the same model, view and projection matrices to each VS invocation but the position is different for each vertex of triangle, which is how we form a triangle after all — three distinct points.

Now, we say that result of each VS invocation will be transformed vertices in clip-space but what does it mean really? Put simply, clip-space is yet another space in our convenience toolbox where clipping is done in the 3D pipeline.

One important thing to realize is that, by this point what we have is 3D (x, y, z, w) homogeneous coordinates in clip-space where the w coordinate is usually non-zero. Unlike NDC-space which we define to be valid in the range [-1, 1], clip-space homogeneous coordinates are contained in [-w, w]

If you are familiar with other types of rasterization algorithms, this is the point at which perspective-division takes place generally. Perspective-division is simply dividing (x, y, z) values in clip-space by their w coordinate so that we end up in [-1, 1] NDC-space. That operation is what gives us the perspective look, actually.

If you read the 2D homogeneous rasterization paper that I’ve linked to in the introduction, you must know that the rasterization algorithm that we utilize here differs in that it does not require perspective-division that we just talked about. Why? Why not just do this division and be done? The reason is that division by w complicates the pipeline in that it requires explicit clipping of triangles against view frustum, because as it happens, there may be triangles whose vertices have w=0, in which case we end up with a division-by-zero. Other than the fact that division operation is slower than addition or multiplication, more so on heavily pipelined architectures such as GPUs, this is the reason that we try and avoid it. We’ll see now how we can get away with it and still apply perspective to our scene.

Similar to Part I, we apply viewport transformation to projected vertices, albeit still in clip-space, in order to set up the base vertex matrix:

            // Apply viewport transformation
            glm::vec4 v0Homogen = TO_RASTER(v0Clip);
            glm::vec4 v1Homogen = TO_RASTER(v1Clip);
            glm::vec4 v2Homogen = TO_RASTER(v2Clip);
            // Base vertex matrix
            glm::mat3 M =
                // Notice that glm is itself column-major)
                { v0Homogen.x, v1Homogen.x, v2Homogen.x},
                { v0Homogen.y, v1Homogen.y, v2Homogen.y},
                { v0Homogen.w, v1Homogen.w, v2Homogen.w},

TO_RASTER macro contains some modifications so let’s have a look at it:

#define TO_RASTER(v) glm::vec4((g_scWidth * (v.x + v.w) / 2), (g_scHeight * (v.w - v.y) / 2), v.z, v.w)

Notice that we still haven’t applied any perspective division, which means that v here is in the range [-w, w]. Do you see why it’s called 2D homogeneous rasterization and not 3D? Look at the vertex matrix, there is no mention of z coordinate. A key realization here is that clip-space w values, under normal perspective transformations (regardless of why type of rasterization algorithm you use) will be equal to view-space z or distance from our camera.

The property that clip-space w equals view-space z is what allows us to interpolate linear functions across triangles in screen-space, as we did with edge functions in Part I, but also any other values defined in vertices in 3D.

The determinant of vertex matrix happens to also encode additional information such as orientation of a triangle (i.e. back-facing or front-facing) so we’ll put it to some use:

            // If det(M) == 0.0f, we'd perform division by 0 when calculating the invert matrix,
            // whereas (det(M) > 0) implies a back-facing triangle
            float det = glm::determinant(M);
            if (det >= 0.0f)

            // Compute the inverse of vertex matrix to use it for setting up edge & constant functions
            M = inverse(M);

After that, we calculate edge functions as in Part I, in addition to constant function that is new here:

            // Set up edge functions based on the vertex matrix
            glm::vec3 E0 = M[0];
            glm::vec3 E1 = M[1];
            glm::vec3 E2 = M[2];

            // Calculate constant function to interpolate 1/w
            glm::vec3 C = M * glm::vec3(1, 1, 1);

We know what the edge functions do already — they help us determine which pixels are inside the triangles that we’re projecting to image-plane, what’s a constant function do? It’s something that we make up, actually, in order to be able to interpolate 1/w values across triangles (don’t worry if it doesn’t make sense right now; we’ll detail how and why we use this constant function in Part III!).

We’re almost there, but with inclusion of third dimension in our world, we have another problem to solve, which is called the visibility problem or hidden surface removal. What this problem is about is quiet simple: Given a set of objects, determine which objects are closer to the camera. 
Various solutions to this problem exist and they incur different trade-offs, offer different performance characteristics and convenience for implementations. If you did any work in 3D APIs such as OpenGL or Direct3D, you know the concept of depth buffer already. That is the mechanism by which we’ll resolve our visibility problem, albeit in a slightly different way in this part.

As in Part I, we start by scanning pixels again and see which are covered or not:

                // Start rasterizing by looping over pixels to output a per-pixel color
                for (auto y = 0; y < g_scHeight; y++)
                    for (auto x = 0; x < g_scWidth; x++)
                        // Sample location at the center of each pixel
                        glm::vec3 sample = { x + 0.5f, y + 0.5f, 1.0f };

                        // Evaluate edge functions at current fragment
                        bool inside0 = EvaluateEdgeFunction(E0, sample);
                        bool inside1 = EvaluateEdgeFunction(E1, sample);
                        bool inside2 = EvaluateEdgeFunction(E2, sample);

                        // If sample is "inside" of all three half-spaces bounded by the three edges of the triangle, it's 'on' the triangle
                        if (inside0 && inside1 && inside2)
                            // Interpolate 1/w at current fragment
                            float oneOverW = (C.x * sample.x) + (C.y * sample.y) + C.z;

                            // Perform depth test with interpolated 1/w value
                            // Notice that the greater 1/w value is, the closer the object would be to our virtual camera,
                            // hence "less_than_equal" comparison is actually oneOverW >= depthBuffer[sample] and not oneOverW <= depthBuffer[sample] here
                            if (oneOverW >= depthBuffer[x + y * g_scWidth])
                                // Depth test passed; update depth buffer value
                                depthBuffer[x + y * g_scWidth] = oneOverW;

                                // Write new color at this fragment
                                frameBuffer[x + y * g_scWidth] = colors[indices[3 * idx] % 6];

Again, we interpolate edge functions at every fragment and check given fragment is visible. If so, we interpolate 1/w value at given visible fragment using the constant function that we have just set up above. To resolve the visibility issue now, we perform what’s called the depth test. It’s basically a comparison of current depth of pixel with incoming depth pixel value, in order to determine whichever is closer to the camera. Once we know the relation between them, we can simply update our depth buffer and write new fragment color at this fragment. Notice that we use a depth buffer of values of interpolated 1/w and not z as is usually done in OpenGL or Direct3D. Why? Because we already have this data available, thanks to the fact that we can interpolate any function defined across triangles. In Part III, we’ll see how we can actually use z among other things. Also see how, due to the fact that we interpolate 1/w, depth test less_than_eqaul corresponds to >=

As we did with frameBuffer, we allocate a depth buffer of the same resolution outside the loop over objects:

    // Allocate and clear the depth buffer (not z-buffer but 1/w-buffer in this part!) to 0.0
    std::vector<float> depthBuffer(g_scWidth * g_scHeight, 0.0);

Note that this time, we improved our edge function a bit not to process vertices on shared edges twice, because that leads not only to waste of computation but also wrong results (think of transparent objects).

If everything went well, we must be welcome by a bunch of cubes rendered:

See you in Part III: Go Wild! where we will improve our visuals quite a bit by handling vertex attributes, as well as their perspective-correct interpolation.

Rasterization in One Weekend, Part I

Code for this part can be found here.

Hello, Triangle!

Mission statement: Given a set of three vertices which form a triangle, we want to rasterize the triangle onto our 2D screen.

Now how do we define those vertices? Two of the options are: raster(or pixel)-space and normalized device coordinates space. In raster-space, vertex coordinates are defined in terms of actual screen resolution whereas NDC directly works with [-1, 1] range to simplify things; i.e. being independent from screen resolution and pixel density. Here’re the vertices of our first triangle:

    // 2DH (x, y, w) coordinates of our triangle's vertices, in counter-clockwise order
    glm::vec3 v0(-0.5, 0.5, 1.0);
    glm::vec3 v1(0.5, 0.5, 1.0);
    glm::vec3 v2(0.0, -0.5, 1.0);

And now, we apply the viewport transformation such that vertices will be in raster-space before we rasterize them. Why do we even need this, you may ask. The reason is that you should always ensure not to mix values in different spaces in the same equations. This is sometimes tricky, because some values will magically work even if you are not careful about conversions, leading to the illusion that all the math is correct.

// Apply viewport transformation
v0 = TO_RASTER(v0);
v1 = TO_RASTER(v1);
v2 = TO_RASTER(v2);

Here’s the dumb macro to apply viewport transformation:

// Transform a given vertex in NDC [-1,1] to raster-space [0, {w|h}]
#define TO_RASTER(v) glm::vec3((g_scWidth * (v.x + 1.0f) / 2), (g_scHeight * (v.y + 1.f) / 2), 1.0f)

Next comes the part to set up a vertex matrix, which we’ll use for testing whether a given screen-space point is inside our triangle or not. Wait, what? Why? It’s basically built upon the fact that all triangles can be defines as a weighted linear combination of their three vertices:

It follows that, alongside the three vertices we’re given, if we have three other values defined at each vertex, we can interpolate them at any point using the barycentric coordinates at a given pixel, granted that the point is inside our triangle. Now, that’s a really powerful property, but how do we find these three coefficients? What we’re trying to do here is basically to solve a system of linear equations:

Before we get into how we can solve such system, let’s clarify why we need them in the first place: They are the ingredients of one of the most fundamental equations in computer graphics, namely, edge functions.

An edge function, simply put, is a way of telling whether a point is to the left or right (or exactly on the edge) of a given edge. And how’s that useful? We have three edges and a point to test, which means that the point will be inside our triangle if and only if it’s to the same side of all three edges. The “same-sidedness” part is simply checking that each edge function has the same sign (depending on how you define your vertices’ winding order as +).

If what we just described was intuitive or makes sense a bit, let’s continue how we could get these findings to fit into our goal of solving the system of linear equations we’ve just set up above.

To solve the above-mentioned system of linear equations with 9 unknowns, we need 9 knowns. Utilizing edge functions we just set up, we can just do this! How, you say? Here’s how:

As you can probably hardly see (sorry for image quality!), each edge function will be constant and equal to 1 on one edge while it’ll be equal 0 on opposite edges. That’s three known values on each edge, which gives us the nice 9 known values we were after. Now, with our vertex matrix and interpolation constraints all set up, we are ready to solve our system of linear equations:

After initializing our base vertex matrix, to compute alpha, beta and gamma for each vertex, we will simply invert our vertex matrix (on leftmost)  once per triangle and use these as edge equations. Note that (1, 0, 0), (0, 1, 0) and (0, 0, 1) values stem from edge functions being one on edge and zero on the opposite ones.

// Initialize base vertex matrix
glm::mat3 M =
    { v0.x, v1.x, v2.x},
    { v0.y, v1.y, v2.y},
    { v0.z, v1.z, v2.z},

// Compute the inverse of vertex matrix to use it for setting up edge functions
M = inverse(M);

// Calculate edge functions based on the vertex matrix
glm::vec3 E0 = M * glm::vec3(1, 0, 0);
glm::vec3 E1 = M * glm::vec3(0, 1, 0);
glm::vec3 E2 = M * glm::vec3(0, 0, 1);

And now the last part where we will simply walk along pixels in our frame buffer to check if they are inside our triangle and if so, output a color to see something resembling a triangle. I believe this is the least complicated part, let’s let the code speak:

    // Start rasterizing by looping over pixels to output a per-pixel color
    for (auto y = 0; y < g_scHeight; y++)
        for (auto x = 0; x < g_scWidth; x++)
            // Sample location at the center of each pixel
            glm::vec3 sample = { x + 0.5f, y + 0.5f, 1.0f };

            // Evaluate edge functions at every fragment
            float alpha = glm::dot(E0, sample);
            float beta = glm::dot(E1, sample);
            float gamma = glm::dot(E2, sample);

            // If sample is "inside" of all three half-spaces bounded by the three edges of our triangle, it's 'on' the triangle
            if ((alpha >= 0.0f) && (beta >= 0.0f) && (gamma >= 0.0f))
                frameBuffer[x + y * g_scWidth] = glm::vec3(1, 0, 0);

If all went well, output of a single frame should be something like this:

And just like that, we completed our entree into the world of rasterization!  Stay tuned for Part II: Go 3D!, where we will have a little sense of 3D and depth.

Rasterization in One Weekend

Welcome to mini-series of Rasterization in One Weekend!

Having lately gotten tired of (well, not really) chasing instructions and signals inside GPUs, I set out to do a small project. As I recently got myself into making another SW-based rasterizer that utilizes modern techniques in an efficient manner and not one that pretends it’s ’65, I wanted to add some basic and beginner-friendly code/information/documents here that one can use to make their own very first rasterizer and maybe realize along the way that fundamentals of the math behind very powerful machines can be understood without too much difficulty after all.

Here, we’ll be closely following a method called “Triangle Scan Conversion using 2D Homogeneous Coordinates” (from one of my all-times favorite papers). Note that we strictly restrict ourselves to rasterization of triangles; different algorithms can be used for different primitive types such as points, lines, etc.

Remember also that, although what lies below can provide good hints on the way GPUs actually work, it’s all immensely simplified. The path from ‘1 static object with 12 triangles‘ to  ‘100,000 animated objects with 10 million triangles‘ on-screen is a long and tedious one. And unlike its rival, rasterization is not kind of algorithm whose parallelism would embarrass you. That’s why actual implementations will employ many, many smart, difficult and expensive to implement algorithms that have taken very smart people many years of R&D, sweat and tear to develop, in order to provide the required rendering capabilities. With that said, here I hope that at least I’ll have helped you scratch the surface a little bit by the end! So dive in:

All code can be found here.


Why Clip Transformed Vertices in Clip-Space?

As a noob in rendering pipelines and computer graphics in general, something that was very puzzling to me and I’d just accepted as-is without truly grasping was that in a 3D rendering pipeline, the operation of clipping transformed vertices takes place in another space, oddly named “clip-space” before perspective division is done. Why the hell?!

As you may know, GPUs invoke a user-defined program, called vertex shader for each vertex of a given primitive so that the primitive will be transformed from whatever space they were defined or modeled in to the clip-space. The result of this invocation will be a 4-elemetn vector in homogeneous coordinates that is used during the clip test such that only those primitives (more like their vertices but anyways) that survive the clip test are passed down the pipeline and the rest are clipped, i.e. removed from the pipeline. The reason GPUs do clipping is not only to increase performance and b/w by clipping primitives that’d have been otherwise a waste of compute because they would be out of viewport but also to guarantee that rasterizers, poor guys having to deal with lots of corner cases already, work within a well-defined distance away from eye/camera/viewpoint via the inclusion of viewport and not deal with arbitrarily positioned vertices (think of numerical precision). That is the clipping operation in a nutshell, however it might not be immediately obvious to you, as it was not to me why we invent yet another confusing space only to apply clipping.

If you think about it, there are 3 candidate locations in the whole 3D pipeline where we could handle the clipping:

  1. Before perspective/orthographic/what-have-you projection using the planes of view frustumfrust
  2. In 4D clip-space before perspective division4d.PNG
  3. In 3D space after perspective division, i.e. working in NDCgl_projectionmatrix01.png

The first doesn’t seem so suitable to me as the calculations would be tied to the way camera (Is it perspective? Is it orthographic?) in a 3D application will be modeled, how about you?

The second, hmm let’s set it aside for a second.

The third looks like a good candidate; we are already done with projecting vertices and dividing by w to give the good old feeling of three-dimensional perspective and able to work within view-volume cuboid handling all degenerate cases like NaNs/Infs nice and easy. Let’s see what happens to the vertices behind the eye/camera/viewpoint:


What is going on in this picture is, the point behind origin, Q2 with a negative w value, when simply transformed via vertex shader to projection-space first and then perspective-divided, projects to a point in front of viewer as if it was visible, which is wrong. What we want is: determine such points with negative w values before applying perspective-division because unless we do something to restore the fact that this point had w=w<0 value and was behind camera, we will lose this information as after perspective division, it’ll have been projected to a point in front of the eye.

Besides the fact that it simplifies the math to do inside/outside tests for clip-planes a lot to apply clipping in 4D clip-space, it also helps us clip primitives behind the camera rather easily. And if you think that it’s rather a rare case when vertices happen to be behind a camera in a 3D application, think again!

Now, even though what’s actually done in a GPU HW can be waaaay different than what’s outlined here in terms of actual computations or implementations, it’s not that far from the truth and what you’d see in the wild.

Please note that I tried and over-simplified a lot such as what happens to partially visible primitives? or guardband-clipping or does HW really check intersection with each half-plane? Really?! for the sake of clarity; maybe these will be get to be an excuse for me to dust off my writing.

Quick Introduction to Remote Debugging via gdb

I have been recently doing some remote apps debugging on Linux-based systems i.e host is on Ubuntu and target machine is an Android-based device and I got to learn some of the gdb remote debugging features along the way so I wanted to write a quick post about it for newbies who could struggle or for those who’are so used to Visual Studio & Windows platform like myself!

Let me get some basic terminology out of the way:

  • Host machine: The machine on which you’re debugging your application. It’s also where you generally keep the unstripped binaries.
  • Target device: The device which your app that you want to debug runs on. That could be any kind of computing platform such as an Android tablet.
  • Executable: This is the binary of your app which we’ll later use to attach gdb to it.
  • gdbserver: This is what we’re gonna use to let our host & target machines communicate. It must be installed on the target device beforehand.

The simple way we’re gonna realize the remote debugging is simply this:

  1. Debug-build your app with optimizations off, assers enabled, etc.
  2. Run gdbserver on target machine to connect app being run to the host machine gdb
  3. Run gdb on host machine to attach it to our app running on target device
  4. Debug happily!

Now let’s imagine that you have already debug-built your app on your host machine  to run it on your target device, let’s go with an Android device for our example. If your target device is a tiny embedded system or doesn’t have much of storage which can’t hold your debug-built app, you can unstrip the binary before pushing it to the device:

strip my_binary my_stripped_binary

adb push my_stripped_binary data/local/data/my_app/ # Push it to Android device

We’re ready to go!

On our target Android device:

cd data/local/data/my_app/

gdbserver :DEBUG_PORT my_stripped_binary app_arguments

Here, we cd into the app directory and call gdbserver with DEBUG_PORT through which we’re going to remote debug, our binary and its arguments. DEBUG_PORT should be an open port so if you’re in an office network or such, check what’s available!

On our host machine:

cd my_apps_directory


file my_binary # The unstripped debug-built binary

target remote localhost:DEBUG_PORT

b main # Set a breakpoint in main() function

b my_function_to_debug.cpp:100 # Here we set a breakpoint in a FILE:LINE pattern

continue # We found the problem, continue running app

If you try to debug an app now, you’ll hopefully find out that your apps on different machines connect to each other through gdb/gdbserver pair yet we cannot really see anything meaningful because we’re missing an essential piece of info here; gdbinit file.

At any time, to see which libraries have been loaded by gdb, enter following in gdb console:

info sharedlibrary # Lists all libraries the app refers to

This file basically describes under what kind of debugging configuration we’re gonna run our app. You can find lots of info about it on Google but main things your .gdbinit file should cover are where your debug-built libraries are so gdb can point us out to corresponding source, assembly style, which signals to ignore or stop at, etc. I generally create slightly different .gdbinit files for each app and put the one under $HOME/ which I currently want to use with gdb. Otherwise, you need to put it alongside your app on target device. I intentionally don’t want to go much into details of .gdbinit file here because for me, it’s how I figured out much of the gdb stuff so I think if you tackle with its usage, what flag does what and all, you’ll get the gist of gdb remote debugging pretty quickly. Otherwise, shoot me a question here and we’ll see what the problem is, maybe!

That kind of does it here. As you may guess, there’s much more to learn and also much more which can go wrong but gdb is heavily used so it’s well-covered all over the internet in fine details.

Any suggestion and corrections are more than welcome! Thank you for your time!