A Short Introduction To SIMD-ifying Scalar Code

I finally took the opportunity that comes disguised as a lazy Sunday to write down a quick’n dirty post on how one could convert a super simple scalar function to SIMD for those who want to experiment with SIMD but feel intimidated by all those nasty-looking stuff. It’s mostly for absolute beginners, so it’ll be neither comprehensive nor eye-opening, but here it goes.

SIMD stands for Single Instruction Multiple Data where a SIMD-capable machine (e.g. a CPU or GPU) executes single instruction on multiple elements simultaneously. Thus, at least from a theoretical point of view, it offers a nice speed-up over traditional scalar code which operates on single element. Actual speed-up will vary depending on various factors.

In this post, we’ll use ubiquitous SSE instruction set, but I am hoping that by the end, converting the same code to AVX (Advanced Vector EXtensions) will be a simple job (and a nice exercise!). If you work with Visual Studio, be sure to check the VS Intrinsics Dude add-on out, which is a godsend for more serious SIMD development.

Now for the sake of this post, assume that we profiled our existing scalar code below and concluded with the not-so-nice fact that it’s been optimized to death and the only way to optimize it further is to put the idle SIMD execution units residing inside modern CPUs to work.

void ComputeResultsScalar(
    const int32_t* pAValues,
    const int32_t* pXValues,
    const int32_t* pBValues,
    int32_t* pResults)
    for (uint32_t i = 0; i < g_sNumInput; i++)
        // Load  a, x and b values
        int32_t a = pAValues[i];
        int32_t x = pXValues[i];
        int32_t b = pBValues[i];

        // Compute y = a * x + b
        int32_t y = a * x + b;

        int result;
        if (y > 0)
            result = 1;
        else if (y == 0)
            result = 0;
            result = -1;
        // Store result for single computation
        pResults[i] = result;

To re-iterate what the above scalar code does; it simply evaluates the linear equation y = ax + b and outputs a 1 if result is positive, -1 if result is negative, 0 otherwise for each input element. Now, let’s try to directly convert the above code to SIMD using SSE4 instruction set.

Whereas scalar code iterates over the input range one element at a time, with SSE we will be processing 4 elements at a time. The reason for this is that SSE instructions can operate on 4 single-precision or 32-bit integers at a time and we need 32-bit integers to compute our results. That is, we could have been just fine processing say, 16 8-bit chars at a time had it been suited to our use case. Another option would be to use AVX instruction set which has 8 lanes of single-precision FP or 32-bit integers or even AVX-512 which has 16 integer lanes!

void ComputeResultsSIMD4(
    const int32_t* pAValues,
    const int32_t* pXValues,
    const int32_t* pBValues,
    int32_t* pResults)
    for (uint32_t i = 0; i < g_sNumInput; i += g_scSIMDWidth)
        // Here be dragons

As in the scalar version, we start by fetching values for current iteration from input arrays:

__m128i sseA4 = _mm_load_si128(reinterpret_cast<const __m128i*>(pAValues + i));
__m128i sseX4 = _mm_load_si128(reinterpret_cast<const __m128i*>(pXValues + i));
__m128i sseB4 = _mm_load_si128(reinterpret_cast<const __m128i*>(pBValues + i));

mm_load_si128, as its name may suggest, loads 128 bits of data from memory specified via mem_addr parameter. What’s the returned variable, though? It’s not an int or a pointer to int (where 4 32-bit integers were loaded, maybe?). What’s going on is, SIMD-capable CPUs contain, in addition to general-purpose and architectural registers that are used for scalar code execution, SIMD registers. These registers are used to store multiple elements (4 32-bit integers or FP in case of SSE, for instance) during execution of SIMD operations. In x64 SSE, there are 16 of them, named XMM0, XMM1, XMM2, …, XMM15. Hence, data fetched from memory via _mm_load_si128 will be loaded in one of those registers.

It’d be also helpful to look at the SIMD instruction piece-wise: _mm prefix used here denotes it’s of SSE instruction set (well, MMX too) whereas load is the function and si128 is the type (integer) of data given instruction operates on. For example, identical instruction for single-precision FP would be _mm_load_ps where ps stands for packed single-precision. It helps quite a bit to internalize that format for quick iterations.

Now that input data is loaded, we can proceed with doing the actual computation y = a * x + b:

_m128i sseY4 = _mm_mullo_epi32(sseA4, sseX4);
sseY4 = _mm_add_epi32(sseY4, sseB4);

Apart from rather awkward and long function names to do what takes a single token in scalar land, it’s fairly straightforward if you try to parse it as above again. First, we multiply 4 a values with 4 x values and then add 4 b values to the result. That is, we call for first multiplication and then addition to happen on SIMD registers which contain a, x and b values for current iteration.

Now that we are done computing y (remember 4 of them!) values, we’d like to compare them to write out 1, 0 or -1 depending on their signs. How do we do that?!

You may have heard before that SIMD code doesn’t mesh well with branchy code. Here’s the reason. If we were to individually check each y to determine its sign in order to store expected value, that’d lead to 4 branches in each iteration! 4! Let alone the overhead of moving data between scalar and SIMD registers repeatedly, branches are one of the biggest enemies of high performance on modern OoO super-scalar CPUs to begin with, so it kind of defeats the purpose of using SIMD for performance (use of SIMD doesn’t come free either).

[Of course, that’s not to say that if your algorithm has unavoidable branches and you still need to use SIMD, you can’t. You sure can. Just be very cautious of plans to SIMD-ify branch-heavy scalar code.]

Hence, to combat the need to use conditionals in SIMD, some other tactics must better be utilized. Key to all of this is masking. What SIMD masks do principally is that they help us mask out values which won’t fit a given criteria (e.g. condition or predicate). As it happens, SSE and other SIMD instruction sets also include instructions which take a mask as input, such that they’d only perform the instruction only on those lanes for which given mask holds true. It’s one of those things that are easier done than said!

Below code uses the SSE comparison instructions to generate 3 masks. We need 3 masks, because we have 3 possible conditions for y to be: They can be positive, negative or equal to zero.

__m128i sseMaskPositive = _mm_cmpgt_epi32(sseY4, _mm_set1_epi32(0));
__m128i sseMaskZero = _mm_cmpeq_epi32(sseY4, _mm_set1_epi32(0));
__m128i sseMaskNegative = _mm_cmplt_epi32(sseY4, _mm_set1_epi32(0));

sseMaskPositive contains 0xFFFFFFFF on each lane where corresponding y value was positive (e.g. greater-than 0) and 0x0 where it was not (e.g. less-than or equal-to 0). Similarly, sseMaskZero contains 0xFFFFFFFF on each lane where corresponding y value was exactly zero, etc.

Now that masks are done, we store the values that we want by write-masking them with corresponding masks.

_mm_maskmoveu_si128(_mm_set1_epi32(1), sseMaskPositive, reinterpret_cast<char*>(pResults + i));
_mm_maskmoveu_si128(_mm_set1_epi32(0), sseMaskZero, reinterpret_cast<char*>(pResults + i));
_mm_maskmoveu_si128(_mm_set1_epi32(-1), sseMaskNegative, reinterpret_cast<char*>(pResults + i));

We strangely use _mm_maskmoveu_si128 instruction to do this. Why? It turns out and happens not so rarely in SIMD development, what we exactly want to do is not present in the instruction set — an instruction which’d take a mask to write 32-bit integers only on lanes where mask is non-zero, something like to __mm_maskstore_epi32. Instead, _mm_maskmoveu_si128 can be used to write-mask 16 8-bit integers. Notice the u suffix? It denotes that this instruction can handle unaligned data accesses.

I listed another method to achieve the exact same results in a slightly more optimal way, give it a look and try to understand why it works.

The Untold

Wrapping up our tour around Intro-To-SIMD land, a few things are left to be noted.

First and foremost, writing SIMD code manually can be mentally taxing, tiring, error-prone and up-front sub-optimal compared to scalar code which compilers are pretty good at to optimize. You won’t reach peak performance if you don’t transform your data to be in a cache-friendly layout.

Next, not only compilers are good at optimizing scalar code, they also happen to be moderately good at auto-vectorizing scalar code to SIMD. Focusing on having compiler(s) at hand succeed at this could yield a much better performance at the end of the day.

With that said and looking at generated assembly from our sample, you can’t really rely on that: https://godbolt.org/z/N8MWjx . Here, you can see that while GCC is perfectly fine with converting the scalar code to SIMD — even if it has a branch in it, simple as it may be, MSVC (maybe unsurprisingly) fails do so. It interestingly succeeds at this version with 3-line change: https://godbolt.org/z/QSSqbA which sort of proves my point! 🙂

A solution which covers most, if not all of these points is ISPC. Although I have never used it yet, it’s highly praised, found helpful and optimal.

The End

Find the code for this little experiment here: https://github.com/NotCamelCase/ShortIntroToSIMD

Here are a few ideas for you to exercise what we just went through:

  • Find a third way to store 1, 0 or -1 depending on the sign of y
  • Implement the same scalar code with AVX instructions
  • Convert a slightly more complicated scalar code, maybe solve a quadratic equation?