In my previous post, I reviewed a new fast random number generator called wyhash. I commented that I expected it to do well on x64 processors (Intel and AMD), but not so well on ARM processors.

Let us review again wyhash:

uint64_t wyhash64_x; uint64_t wyhash64() { wyhash64_x += 0x60bee2bee120fc15; __uint128_t tmp; tmp = (__uint128_t) wyhash64_x * 0xa3b195354a39b70d; uint64_t m1 = (tmp >> 64) ^ tmp; tmp = (__uint128_t)m1 * 0x1b03738712fad5c9; uint64_t m2 = (tmp >> 64) ^ tmp; return m2; }

It is only two multiplications (plus a few cheap operations like add and XOR), but these are full multiplications producing a 128-bit output.

Let us compared with a similar but conventional generator (splitmix) developed by Steele et al. and part of the Java library:

uint64_t splitmix64(void) { splitmix64_x += 0x9E3779B97F4A7C15; uint64_t z = splitmix64_x; z = (z ^ (z >> 30)) * 0xBF58476D1CE4E5B9; z = (z ^ (z >> 27)) * 0x94D049BB133111EB; return z ^ (z >> 31); }

We have three multiplications this time. So you would expect splitmix to be slower. And it is, on x64 processors.

Let me reuse my benchmark where I simply sum up 524288 random integers are record how long it takes…

Skylake x64 | Skylark ARM | |

wyhash | 0.5 ms | 8.8 ms |

splitmix | 2.6 ms | 2.6 ms |

According to my tests, on the x64 processor, wyhash is five times faster than splitmix. When I switch to my ARM server, wyhash becomes 4 times slower.

The difference is that the computation of the most significant bits of a 64-bit product on an ARM processor requires a separate and expensive instruction.

Note: I have about half a million integers, so if you double my numbers, you will get a rough estimate of the number of integers generated per nanosecond.