FizzBuzz – SIMD Style!


To see how the Vector API can help with improving the performance of some calculation, let’s consider FizzBuzz. Originally, FizzBuzz is a game to help teaching children division; but interestingly, it also serves as entry-level interview question for hiring software engineers in some places. In any case, it’s a nice example for exploring how some calculation can benefit from vectorization. The rules of FizzBuzz are simple:

  • Numbers are counted and printed out: 1, 2, 3, …​

  • If a number if divisible by 3, instead of printing the number, print "Fizz"

  • If a number if divisible by 5, print "Buzz"

  • If a number if divisible by 3 and 5, print "FizzBuzz"

As the Vector API concerns itself with numeric values instead of strings, rather than "Fizz", "Buzz", and "FizzBuzz", we’re going to emit -1, -2, and -3, respectively. The input of the program will be an array with the numbers from 0 …​ 256, the output an array with the FizzBuzz sequence:

1, 2, -1, 4, -2, -1, 7, 8, -1, -2, 11, -1, 13, 14, -3, 16, ...

The task is easily solved using a plain for loop processing scalar values one by one:

private static final int FIZZ = -1;
private static final int BUZZ = -2;
private static final int FIZZ_BUZZ = -3; public int[] scalarFizzBuzz(int[] values) { int[] result = new int[values.length]; for (int i = 0; i < values.length; i++) { int value = values[i]; if (value % 3 == 0) { if (value % 5 == 0) { (1) result[i] = FIZZ_BUZZ; } else { result[i] = FIZZ; (2) } } else if (value % 5 == 0) { result[i] = BUZZ; (3) } else { result[i] = value; (4) } } return result;
}
1 The current number is divisible by 3 and 5: emit FIZZ_BUZZ (-3)
2 The current number is divisible by 3: emit FIZZ (-1)
3 The current number is divisible by 5: emit BUZZ (-2)
4 The current number is divisible by neither 3 nor 5: emit the number itself

As a baseline, this implementation can be executed ~2.2M times per second in a simple JMH benchmark running on my Macbook Pro 2019, with a 2.6 GHz 6-Core Intel Core i7 CPU:

Benchmark (arrayLength) Mode Cnt Score Error Units
FizzBuzzBenchmark.scalarFizzBuzz 256 thrpt 5 2204774,792 ± 76581,374 ops/s

Now let’s see how this calculation could be vectorized and what performance improvements can be gained by doing so. When looking at the incubating Vector API, you may be overwhelmed at first by its large API surface. But it’s becoming manageable once you realize that all the types like IntVector, LongVector, etc. essentially expose the same set of methods, only specific for each of the supported data types (and indeed, as per the JavaDoc, all these classes were not hand-written by some poor soul, but generated, from some sort of parameterized template supposedly).

Amongst the plethora of API methods, there is no modulo operation, though (which makes sense, as for instance there isn’t such instruction in any of the x86 SIMD extensions). So what could we do to solve the FizzBuzz task? After skimming through the API for some time, the method blend​(Vector<Integer> v, VectorMask<Integer> m) caught my attention:

Replaces selected lanes of this vector with corresponding lanes from a second input vector under the control of a mask. […​]

  • For any lane set in the mask, the new lane value is taken from the second input vector, and replaces whatever value was in the that lane of this vector.

  • For any lane unset in the mask, the replacement is suppressed and this vector retains the original value stored in that lane.

This sounds pretty useful; The pattern of expected -1, -2, and -3 values repeats every 15 input values. So we can "pre-calculate" that pattern once and persist it in form of vectors and masks for the blend() method. While stepping through the input array, the right vector and mask are obtained based on the current position and are used with blend() in order to mark the values divisible by 3, 5, and 15 (another option could be min(Vector<Integer> v), but I decided against it, as we’d need some magic value for representing those numbers which should be emitted as-is).

Here is a visualization of the approach, assuming a vector length of eight elements ("lanes"):

Determining FizzBuzz Values Via Vector Blending

So let’s see how we can implement this using the Vector API. The mask and second input vector repeat every 120 elements (least common multiple of 8 and 15), so 15 masks and vectors need to be determined. They can be created like so:

public class FizzBuzz { private static final VectorSpecies<Integer> SPECIES = IntVector.SPECIES_256; (1) private final List<VectorMask<Integer>> resultMasks = new ArrayList<>(15); private final IntVector[] resultVectors = new IntVector[15]; public FizzBuzz() { List<VectorMask<Integer>> threes = Arrays.asList( (2) VectorMask.<Integer>fromLong(SPECIES, 0b00100100), VectorMask.<Integer>fromLong(SPECIES, 0b01001001), VectorMask.<Integer>fromLong(SPECIES, 0b10010010) ); List<VectorMask<Integer>> fives = Arrays.asList( (3) VectorMask.<Integer>fromLong(SPECIES, 0b00010000), VectorMask.<Integer>fromLong(SPECIES, 0b01000010), VectorMask.<Integer>fromLong(SPECIES, 0b00001000), VectorMask.<Integer>fromLong(SPECIES, 0b00100001), VectorMask.<Integer>fromLong(SPECIES, 0b10000100) ); for(int i = 0; i < 15; i++) { (4) VectorMask<Integer> threeMask = threes.get(i%3); VectorMask<Integer> fiveMask = fives.get(i%5); resultMasks.add(threeMask.or(fiveMask)); (5) resultVectors[i] = IntVector.zero(SPECIES) (6) .blend(FIZZ, threeMask) .blend(BUZZ, fiveMask) .blend(FIZZ_BUZZ, threeMask.and(fiveMask)); } }
}
1 A vector species describes the combination of an vector element type (in this case Integer) and a vector shape (in this case 256 bit); i.e. here we’re going to deal with vectors that hold 8 32 bit int values
2 Vector masks describing the numbers divisible by three (read the bit values from right to left)
3 Vector masks describing the numbers divisible by five
4 Let’s create the fifteen required result masks and vectors
5 A value in the output array should be set to another value if it’s divisible by three or five
6 Set the value to -1, -2, or -3, depending on whether its divisible by three, five, or fifteen, respectively; otherwise set it to the corresponding value from the input array

With this infrastructure in place, we can implement the actual method for calculating the FizzBuzz values for an arbitrarily long input array:

public int[] simdFizzBuzz(int[] values) { int[] result = new int[values.length]; int i = 0; int upperBound = SPECIES.loopBound(values.length); (1) for (; i < upperBound; i += SPECIES.length()) { (2) IntVector chunk = IntVector.fromArray(SPECIES, values, i); (3) int maskIdx = (i/SPECIES.length())%15; (4) IntVector fizzBuzz = chunk.blend(resultValues[maskIdx], resultMasks[maskIdx]); (5) fizzBuzz.intoArray(result, i); (6) } for (; i < values.length; i++) { (7) int value = values[i]; if (value % 3 == 0) { if (value % 5 == 0) { result[i] = FIZZ_BUZZ; } else { result[i] = FIZZ; } } else if (value % 5 == 0) { result[i] = BUZZ; } else { result[i] = value; } } return result;
}
1 determine the maximum index in the array that’s divisible by the species length; e.g. if the input array is 100 elements long, that’d be 96 in the case of vectors with eight elements each
2 Iterate through the input array in steps of the vector length
3 Load the current chunk of the input array into an IntVector
4 Obtain the index of the right result vector and mask
5 Determine the FizzBuzz numbers for the current chunk (i.e. that’s the actual SIMD instruction, processing all eight elements of the current chunk at once)
6 Copy the result values at the right index into the result array
7 Process any remainder (e.g. the last four remaining elements in case of an input array with 100 elements) using the traditional scalar approach, as those values couldn’t fill up another vector instance

To reiterate what’s happening here: instead of processing the values of the input array one by one, they are processed in chunks of eight elements each by means of the blend() vector operation, which can be mapped to an equivalent SIMD instruction of the CPU. In case the input array doesn’t have a length that’s a multiple of the vector length, the remainder is processed in the traditional scalar way. The resulting duplication of the logic seems a bit inelegant, we’ll discuss in a bit what can be done about that.

For now, let’s see whether our efforts pay off; i.e. is this vectorized approach actually faster then the basic scalar implementation? Turns out it is! Here are the numbers I get from JMH on my machine, showing through-put increasing by factor 3:

Benchmark (arrayLength) Mode Cnt Score Error Units
FizzBuzzBenchmark.scalarFizzBuzz 256 thrpt 5 2204774,792 ± 76581,374 ops/s
FizzBuzzBenchmark.simdFizzBuzz 256 thrpt 5 6748723,261 ± 34725,507 ops/s

Is there anything that could be further improved? I’m pretty sure, but as said I’m not an expert here, so I’ll leave it to smarter folks to point out more efficient implementations in the comments. One thing I figured is that the division and modulo operation for obtaining the current mask index isn’t ideal. Keeping a separate loop variable that’s reset to 0 after reaching 15 proved to be quite a bit faster:

public int[] simdFizzBuzz(int[] values) { int[] result = new int[values.length]; int i = 0; int j = 0; int upperBound = SPECIES.loopBound(values.length); for (; i < upperBound; i += SPECIES.length()) { IntVector chunk = IntVector.fromArray(SPECIES, values, i); IntVector fizzBuzz = chunk.blend(resultValues[j], resultMasks[j]); fizzBuzz.intoArray(result, i); j++; if (j == 15) { j = 0; } } // processing of remainder...
}
Benchmark (arrayLength) Mode Cnt Score Error Units
FizzBuzzBenchmark.scalarFizzBuzz 256 thrpt 5 2204774,792 ± 76581,374 ops/s
FizzBuzzBenchmark.simdFizzBuzz 256 thrpt 5 6748723,261 ± 34725,507 ops/s
FizzBuzzBenchmark.simdFizzBuzzSeparateMaskIndex 256 thrpt 5 8830433,250 ± 69955,161 ops/s

This makes for another nice improvement, yielding 4x the throughput of the original scalar implementation. Now, to make this a true apple-to-apple comparison, a mask-based approach can also be applied to the purely scalar implementation, only that each value needs to be looked up individually:

private int[] serialMask = new int[] {0, 0, -1, 0, -2, -1, 0, 0, -1, -10, 0, -1, 0, 0, -3}; public int[] serialFizzBuzzMasked(int[] values) { int[] result = new int[values.length]; int j = 0; for (int i = 0; i < values.length; i++) { int res = serialMask[j]; result[i] = res == 0 ? values[i] : res; j++; if (j == 15) { j = 0; } } return result;
}

Indeed, this implementation is quite a bit better than the original one, but still the SIMD-based approach is more than twice as fast:

Benchmark (arrayLength) Mode Cnt Score Error Units
FizzBuzzBenchmark.scalarFizzBuzz 256 thrpt 5 2204774,792 ± 76581,374 ops/s
FizzBuzzBenchmark.scalarFizzBuzzMasked 256 thrpt 5 4156751,424 ± 23668,949 ops/s
FizzBuzzBenchmark.simdFizzBuzz 256 thrpt 5 6748723,261 ± 34725,507 ops/s
FizzBuzzBenchmark.simdFizzBuzzSeparateMaskIndex 256 thrpt 5 8830433,250 ± 69955,161 ops/s