AVX512 operates on 64-byte register. The extension AVX512VBMI introduced
instruction `VPERMB` (`_mm512_permutexvar_epi32`) which similarly to
`PSHUFB` can change order of bytes within an AVX512 register. For readers
who are not familiar with AVX512 nuances, the extension AVX512BW also defines
a byte shuffling instruction, but it operates on the register lanes (16-byte
subvectors) not the whole register.

Since AVX512 registers are so wide, use of a lookup table to fetch
shuffle patterns is simply impossible. Such table would occupy
2^{70} bytes (2^{40} is 1TB).

Instead of precalculating shuffle patterns I propose to build them in
**runtime**. This might seem not optimal on the first sight, but the
evaluation shows that it's not that bad.

In AVX512 code we process 64-byte block. From the input vector we obtain a
64-bit mask for spaces, and then modify **the shuffle vector** using this
mask.

Initially the shuffle vector defines identity transformation, i.e. if applied to the shuffle instruction it would copy all i-th input bytes onto i-th output byte. Technically, the vector contains sequence of bytes from 0 to 63.

Let's assume there's exactly one space in the input vector,
say at the position 5; this will be our **building block** for the
rest of algorithm.

The shuffle vector:

shuffle = [0, 1, 2, 3, 4, 5, 6, 7, 8, ...] ^

must become:

shuffle = [0, 1, 2, 3, 4, 6, 7, 8, 9, ...]

In other words we must perform following vector addition:

[0, 1, 2, 3, 4, 5, 6, 7, 8, ...] + [0, 0, 0, 0, 0, 1, 1, 1, 1, ...]

To do this we use a nice AVX512 facility, **the masked add**.

const __m512i = _mm512_set1_epi8(1); const uint64_t addmask = /* ??? */ shuffle = _mm512_mask_add_epi8(shuffle, addmask, shuffle, ones);

But how to cheaply calculate a proper mask? From mask `000...000100000`,
we have to get `111...111100000`, i.e. all bits above the set bit must
also be ones. The input mask has exactly one bit set; we subtract 1:

000...00100000 - 1 = 000...000011111

Now all bits below become 1, thus a bit negation yields the required bit pattern. The full expression in C is like this:

const uint64_t addmask = ~(mask - 1);

Now let's consider a more complex case. The input vector contains two
non-adjacent spaces. Assume the first one is at index 2, and the
second one at 5, thus the bit mask is `000...000100100`.

First we isolate the lowest bit set using expression `(x & -x)`,
or `x & (~x + 1)`; on an AVX512VBMI CPU this expression should be
compiled into single instruction `BLSI`:

// mask = 000...00000100100 // first = 000...00000000100 uint64_t first = (mask & -mask);

Since the mask `first` has exactly one bit set, we use the procedure
described in the previous section to modify the shuffle pattern:

shuffle = [0, 1, 2, 3, 4, 5, 6, 7, 8, ...] + [0, 0, 1, 1, 1, 1, 1, 1, 1, ...] = [0, 1, 3, 4, 5, 6, 7, 8, 9, ...]

Now, we reset the first bit set from mask:

// mask = 000...00000100100 // ^ 000...00000000100 // = 000...00000100000 mask = mask ^ first;

And we can again extract the lowest bit set. Hold on, can we? No, it's not possible as the shuffle pattern has just been changed, thus our initial 5th bit doesn't indicate the space character. Since one character was skipped, the another space character is at index 4.

To reflect this change the mask must be **shifted right by 1**:

mask >>= 1;

Now, we might safely extract the lowest bit set and modify the shuffle pattern:

shuffle = [0, 1, 3, 4, 5, 6, 7, 8, 9, ...] + [0, 0, 0, 0, 1, 1, 1, 1, 1, ...] = [0, 1, 3, 4, 6, 7, 8, 9, 10, ...]

Obviously, if there are more ones in the mask, we need to carry on the above procedure (extract bit, reset, shift). If mask becomes zero we stop modifying the shuffle vector.

There's still one problem to solve, what if there are more spaces in a row.
For instance, the run has three ones starts at index 2: `000...00011100`.

We need to modify shuffle vector starting from index 2, but increment is 3 not 1:

shuffle = [0, 1, 2, 3, 4, 5, 6, 7, 8, ...] + [0, 0, 3, 3, 3, 3, 3, 3, 3, ...] = [0, 1, 5, 6, 7, 8, 9, 10, 11, ...]

Firstly, we must save the position (mask) for the first bit of run. Then we need detect if the next set bit (a) continues the run, or (b) starts a new one. If it's a continuation, we increment by one a vector that holds run's length. If it's a new run, the shuffle vector is modified with length vector.

uint64_t first; uint64_t curr; __m128i increment = ones; first = (mask & -mask); mask = (mask ^ first) >> 1; while (/* some condition*/) { curr = (mask & -mask); mask = (mask ^ curr) >> 1; if (/* run continues */) increment = _mm512_add_epi8(increment, ones); else { /* finalize the previous run */ shuffle = _mm512_mask_add_epi8(shuffle, ~(first - 1), shuffle, increment); /* initialize a new one */ first = curr increment = ones; } }

How to detect the continuation? We need to keep the previously extracted
mask, if it's **equal** to the currently extracted mask, it's a
continuation. Equality works because after each extraction the mask is
shifted.

Below is sequence of values which appear during analysing the second
bit of a run; as we see the mask `curr` is equal to `prev`.

// mask = 000...00011100 // prev = 000...00000100 prev = (mask & -mask) // mask = 000...00001100 mask = (mask ^ curr) >> 1; // curr = 000...00000100 curr = (mask & -mask);

Below is an actual AVX512VBMI code which implements all the techniques presented above.

char* remove_spaces__avx512vbmi(const char* src, char* dst, size_t n) { assert(n % 64 == 0); // values 0..63 const __m512i no_gaps_indices = _mm512_setr_epi32( 0x03020100, 0x07060504, 0x0b0a0908, 0x0f0e0d0c, 0x13121110, 0x17161514, 0x1b1a1918, 0x1f1e1d1c, 0x23222120, 0x27262524, 0x2b2a2928, 0x2f2e2d2c, 0x33323130, 0x37363534, 0x3b3a3938, 0x3f3e3d3c ); const __m512i ones = _mm512_set1_epi8(1); const __m512i NL = _mm512_set1_epi8('\n'); const __m512i CR = _mm512_set1_epi8('\r'); const __m512i spaces = _mm512_set1_epi8(' '); size_t len; for (size_t i=0; i < n; i += 64) { const __m512i input = _mm512_loadu_si512((const __m512i*)(src + i)); __m512i output; uint64_t mask = _mm512_cmpeq_epi8_mask(input, spaces) | _mm512_cmpeq_epi8_mask(input, NL) | _mm512_cmpeq_epi8_mask(input, CR); if (mask) { len = 64 - __builtin_popcountll(mask); __m512i indices = no_gaps_indices; __m512i increment = ones; uint64_t first; uint64_t prev; first = (mask & -mask); prev = first; mask ^= first; mask >>= 1; while (mask) { const uint64_t curr = (mask & -mask); mask ^= curr; mask >>= 1; if (prev == curr) { increment = _mm512_add_epi8(increment, ones); prev = curr; } else { indices = _mm512_mask_add_epi8(indices, ~(first - 1), indices, increment); first = curr; prev = curr; increment = ones; } } indices = _mm512_mask_add_epi8(indices, ~(first - 1), indices, increment); output = _mm512_permutexvar_epi8(indices, input); } else { output = input; len = 64; } _mm512_storeu_si512((__m512i*)(dst), output); dst += len; } return dst; }