Checking for utf8 string validity using SIMD intrinsics is the logical performance optimization to make (having significant experience with this myself). However, in reaching first for our largest optimization hammer, have we have we prematurely neglected the possible performant scalar solutions? In particular the state machine like nature of utf8 string validity checking strongly suggests a lookup table type of solution.
So how might a non SIMD, lookup table (effectively 16 bit) version operate? And what might be its performance? Lookup table based utf8 validate can out perform SIMD-128bit utf8 validate but does not do so on average due to caching effects.First like other utf-8 validating schemes we start with a decoder state. However, due to the fact that we will be using these in lookup tables we will try to reduce the total number of states to as small as possible (with utf8 decode rules appearing in Lemire's post). Below appears the table that maps the next state for each consumed byte (hex) to a enum.
|Complete non-degenerate decoder state graph|
Once we have our states we can begin with our initial naive lookup table implementation. Our initial implementation will be a single table that takes in the current state and the current charbyte of the array and maps it to the next decoder state. This table is is simple to construct taking a brute force 2^(4+8) operations. Below is a graphical illustration of execution with the propagation of the decoding state.
|Flow of decoder state passing through lookup tables|
uint8_t runState = READY_NEW;
while (len > 0)
runState = tableDecode[ *src++ | (runState<< 8)];
} So how does this naive implementation perform? The answer is not that well at all! It beats a branching and testing implementation but isnt anywhere close to the original SIMD-128 version published by Lemaire. The problem with this naive lookup table approach is the latency of a memory load. Even for data in L1 cache the latency of a indexed lookup is at least 5 cycles. The fastest this naive approach will ever be is 5 cycles per byte. While modern hardware can issue multiple instructions in a single cycle (otherwise known as superscalar) the data dependency here prevents this technique from providing any additional performance. To go beyond this naive implementation we need to reduce instruction dependencies and decode more than one byte at a time.
|Naive Lookup Implementation will take at least 5 cycles per byte|
In full details the code of the validation loop looks something like this:
const uint16_t* asShort = reinterpret_cast<const uint16_t*>(src);
const uint16_t* asShortEnd = &asShort[len / sizeof(uint16_t)];
uint8_t runCon = runState;
while (asShort != asShortEnd)
uint8_t decoded0 = tableDecode[*asShort++];
uint8_t decoded1 = tableDecode[*asShort++];
uint8_t decoded2 = tableDecode[*asShort++];
uint8_t decoded3 = tableDecode[*asShort++];
uint8_t prejoin0 = tableJoin[(decoded0 << 6) | decoded1 ];
uint8_t prejoin1 = tableJoin[(decoded2 << 6) | decoded3 ];
uint8_t toCon = tableJoin[(prejoin0 << 6)| prejoin1];
runCon = tableConcate[ runCon | (toCon << 4)];
The results actually correspond to general expectations. The lookup table driven algorithm requires a hot cache to perform anywhere close to the SIMD-128 version thus the performance is only comparable for large strings or high sample repeats. As it stands the performance can be summarized as follows: In the limit of long strings and repeated usage Lookup Table is still 6% minimum slower and %12 slower on average than SIMD-128bit. While still slightly slower on average it is quite remarkable that an algorithm that works in effectively 16 bits can be competitive with a 128 bit SIMD algorithm. Even more surprising is that it can, under certain conditions, exceed the speed of code that makes use of these powerful "vector" instructions. Here are the suggested reasons why the table based approach achieves such performance:
- Superscalar machines perform SIMD-like operations automatically by issuing multiple instructions per cycle.
- Lookup tables perform computations in a manner such that they are in effect creating new instructions custom fit for the algorithms specific purpose.
- Caching behavior causes lookup table algorithms to increase in performance for specific data Values (working similar the effects of branch prediction ).
Despite having a less than 2 year old cpu (i7-6700HQ) with AVX2 support the newer AVX version published by Lemire is actually always significantly slower (~30%) than his original SIMD-128bit version on my machine. What is the nature of this unexpected pessimization is not clear. My full table lookup utf8 validity checking algorithm can be found here:
ScreenToGif was used to create animations