ARM and Intel have different performance characteristics: a case study in random number generation

By Daniel Lemire

In my previous post, I reviewed a new fast random number generator called wyhash. I commented that I expected it to do well on x64 processors (Intel and AMD), but not so well on ARM processors.

Let us review again wyhash:

uint64_t wyhash64_x; uint64_t wyhash64() { wyhash64_x += 0x60bee2bee120fc15; __uint128_t tmp; tmp = (__uint128_t) wyhash64_x * 0xa3b195354a39b70d; uint64_t m1 = (tmp >> 64) ^ tmp; tmp = (__uint128_t)m1 * 0x1b03738712fad5c9; uint64_t m2 = (tmp >> 64) ^ tmp; return m2; } 

(Source code)

It is only two multiplications (plus a few cheap operations like add and XOR), but these are full multiplications producing a 128-bit output.

Let us compared with a similar but conventional generator (splitmix) developed by Steele et al. and part of the Java library:

 uint64_t splitmix64(void) { splitmix64_x += 0x9E3779B97F4A7C15; uint64_t z = splitmix64_x; z = (z ^ (z >> 30)) * 0xBF58476D1CE4E5B9; z = (z ^ (z >> 27)) * 0x94D049BB133111EB; return z ^ (z >> 31); } 

We have three multiplications this time. So you would expect splitmix to be slower. And it is, on x64 processors.

Let me reuse my benchmark where I simply sum up 524288 random integers are record how long it takes…

Skylake x64 Skylark ARM
wyhash 0.5 ms 8.8 ms
splitmix 2.6 ms 2.6 ms

According to my tests, on the x64 processor, wyhash is five times faster than splitmix. When I switch to my ARM server, wyhash becomes 4 times slower.

The difference is that the computation of the most significant bits of a 64-bit product on an ARM processor requires a separate and expensive instruction.

Note: I have about half a million integers, so if you double my numbers, you will get a rough estimate of the number of integers generated per nanosecond.