Intel’s latest processors come with powerful new instructions from the AVX-512 family. These instructions operate over 512-bit registers. They use more power than regular (64-bit) instructions. Thus, on some Intel processors, the processor core that is using AVX-512 might run at a lower frequency, to keep the processor from overheating.
Can we measure this effect?
In a recent post, I used a benchmark provided by Vlad Krasnov from Cloudfare, on a Xeon Gold 5120 processor. In the test provided by Krasnov, the use of AVX-512 actually made things faster.
So I just went back to an earlier benchmark I designed myself. It is a CPU-intensive Mandelbrot computation, with very few bogus AVX-512 instructions thrown in (about 32,000). The idea is that if AVX-512 cause frequency throttling, the whole computation will be slowed. I use two types of AVX-512 instructions: light (additions) and heavy (multiplications).
I measured no AVX-512 throttling on the Skylake X server I own… but what about the Xeon Gold 5120 processor?
I run the benchmark ten times and measure the wall-clock time using the Linux/bash time command. I sleep 2 seconds after each sequence of ten tests. A complete script is provided. In practice, I just run the benchmark.sh script after typing make and I record the user timings of each test.
Because there are run-to-run variations, I repeat the whole process several times. Here are my raw numbers:
|trial||No AVX-512||Light AVX-512||Heavy AVX-512|
|1||13.23 s||13.64 s||13.51 s|
|2||13.15 s||13.49 s||13.53 s|
|3||13.10 s||13.59 s||13.51 s|
|4||13.07 s||13.54 s||13.59 s|
|5||13.08 s||13.52 s||13.51 s|
So the No-AVX-512 program runs in about 13.1 s. The AVX-512 ones run in about 13.5 s. Thus AVX-512 incurs a 3% penalty. I can’t measure a difference between light and heavy AVX-512 instructions.
It is first time ever I am able to measure a negative difference that can be attributed, presumably, to AVX-512 throttling. It is quite exciting.
Is that a lot? It is hard for me to get terribly depressed at the fact that a benchmark I specifically designed to make AVX-512 look bad sees a 3% performance degradation on one core. Real code is not going to use AVX-512 in such a manner: the AVX-512 instructions will do useful work. It is not super difficult to recoup a 3% difference.
Of course, maybe my adversarial benchmark is not sufficiently bad. That may well be: please provide me with your examples.