The dangers of AVX-512 throttling: a 3% impact

By Daniel Lemire

Intel’s latest processors come with powerful new instructions from the AVX-512 family. These instructions operate over 512-bit registers. They use more power than regular (64-bit) instructions. Thus, on some Intel processors, the processor core that is using AVX-512 might run at a lower frequency, to keep the processor from overheating.

Can we measure this effect?

In a recent post, I used a benchmark provided by Vlad Krasnov from Cloudfare, on a Xeon Gold 5120 processor. In the test provided by Krasnov, the use of AVX-512 actually made things faster.

So I just went back to an earlier benchmark I designed myself. It is a CPU-intensive Mandelbrot computation, with very few bogus AVX-512 instructions thrown in (about 32,000). The idea is that if AVX-512 cause frequency throttling, the whole computation will be slowed. I use two types of AVX-512 instructions: light (additions) and heavy (multiplications).

I measured no AVX-512 throttling on the Skylake X server I own… but what about the Xeon Gold 5120 processor?

I run the benchmark ten times and measure the wall-clock time using the Linux/bash time command. I sleep 2 seconds after each sequence of ten tests. A complete script is provided. In practice, I just run the benchmark.sh script after typing make and I record the user timings of each test.

Because there are run-to-run variations, I repeat the whole process several times. Here are my raw numbers:

trialNo AVX-512Light AVX-512Heavy AVX-512
113.23 s13.64 s13.51 s
213.15 s13.49 s13.53 s
313.10 s13.59 s13.51 s
413.07 s13.54 s13.59 s
513.08 s13.52 s13.51 s

So the No-AVX-512 program runs in about 13.1 s. The AVX-512 ones run in about 13.5 s. Thus AVX-512 incurs a 3% penalty. I can’t measure a difference between light and heavy AVX-512 instructions.

It is first time ever I am able to measure a negative difference that can be attributed, presumably, to AVX-512 throttling. It is quite exciting.

Is that a lot? It is hard for me to get terribly depressed at the fact that a benchmark I specifically designed to make AVX-512 look bad sees a 3% performance degradation on one core. Real code is not going to use AVX-512 in such a manner: the AVX-512 instructions will do useful work. It is not super difficult to recoup a 3% difference.

Of course, maybe my adversarial benchmark is not sufficiently bad. That may well be: please provide me with your examples.