We are still limited by our current hardware. There are numerous areas where it just not good enough: neural networks and virtual reality to name a few. There are plenty of devices where battery life is crucial, and we must count every single CPU tick. Even when we’re talking about clouds and microservices and lambdas, there are enormous data centers that consume vast amounts of electricity.
Even boring tests routine may quietly start to take 5 hours to run. And this is tricky. Program performance doesn‘t matter, only until it does.
A modern way to squeeze performance out of silicon is to make hardware more and more sophisticated.
We’re going into the times where computers are tiny, cores are many, memory is a bottleneck (praying behavior to be defined).
billions of transistors
tries to predict the future
tracks data and instructions dependencies
executes commands in parallel and out of order
programmable (has its own microcode)
aware of memory and CPU virtualization techniques
has some high-level instructions (cryptographic hash functions)
there is even somewhat neural network to predict branches¹
several critical vulnerabilities were found so far
billions of capacitors
billions of capacitors, really — that’s it
DRAM (Dynamic random-access memory) which uses memory cells consisting of one capacitor and one transistor to store each bit. This is the cheapest and highest in density, so it is used for the main memory in computers².
Memory loads and stores are often the bottleneck.
One memory reference cost at around 100 nanoseconds³. It means that 1GHz CPU can spend 100 or more ticks waiting for a value from memory. So caching values in memory instead of recalculating every time may be literally 100 times slower. Say what?!
Even though CPU is smart enough to fetch the data beforehand, in many cases, programmers are still in charge of the memory layout and access patterns.
So we have billions of lines of code in the one hand and a hardware architectural diversity in the other. Many architectures introduce opinionated methods to achieve maximal performance or minimal energy consumption. To fill this gap is to delegate the most of work to an optimizing compiler.
Modern compilers can do a lot of modifications to the code in order to improve performance. It is useful for the programmer to know what the compiler can do and what it can not do.
~ Agner Fog “Optimizing software in C++” ⁴
A source code we write is read by a machine and our fellow programmers. Let’s focus on the code readability while compilers turn our code into a chain of machine-optimal instructions.
Unfortunately, nothing is automagical. C and C++ optimizing compilers are far from the ideal; it is vital to understand what is under compiler’s hood. It is a tool we use on a daily basis after all.
I would like to make a quick recap here, although you absolutely have to watch the whole video.
I tried to repeat every example from the Chandler’s presentation myself using clang -emit-llvm, and most code samples here were produced directly with LLVM. Although, some listings were shamelessly copypasted from the Chandler Carruth’s slides.
Understanding performance means understanding optimizers.
~ Chandler Carruth “Opening Keynote Meeting C++ 2015”
Even though the following listings were made with LLVM (clang), but the same principles apply to the GCC, and to some extent to an every modern optimizing compiler.
Compilers heavily rely on the static single assignment (SSA) form, that is: requires that each variable is assigned exactly once, and every variable is defined before it is used⁷.
The code above should be very straightforward:
functions — @g and @f
types — i32
values (variables) — %a, %b, …
instructions — add, call, ret
The br operation — branch on a flag, and go to the (corresponding) label operation. It controls the whole control flow.
Again: each variable is assigned exactly once, and every variable is defined before it is used.
But how can we get a value that is defined only in a conditional branch, like %d in the above example?
The answer is the phi instruction, that merges branch values into one, based on where you came from.
If we came from the entry then %result = %c, if came from the then then %result = %d.
The LLVM IR is that simple. There are lots of other instructions, but nothing besides this.
Based on this representation only, you can already try to guess what’s happening behind the curtains.
A compiler’s frontend produces ad-hoc IR code like there is an unlimited amount of memory; it cares only to produce patterns “familiar” to an optimizer.
Here is an example:
That what we would get if we run clang -cc1 -emit-llvm -O0 example.c -o example.ll (no optimization). LLVM frontend places everything in a memory, so there are chains of alloca, store, and load:
At the cleanup stage, a compiler tries to replace memory-related instructions with SSA form values. Later, it will be decided whether to allocate memory or to use registers. Some variables (often the majority) will be eliminated at all.
This is what happens when we enable optimization (-O1):
Of course, this code is not just cleaned, but also optimized. Now you may see how LLVM IR repeats the C code it was made from.
The icmp compares integers, pointers or vectors and returns boolean. The first argument is a keyword that indicates kind of comparison. Here icmp eq means compare for equality. Other examples would be: ult — unsigned less than, sge — signed greater or equal, etc.
The getelementptr is an instruction that handles all pointer arithmetics in LLVM. No matter if there’s an array element or a struct field. The last argument of getelementptr is an element’s index.
There are many possible ways to write the same code using control flow statements:
An optimizer rewrites control flow to the canonical form, rather than trying to recognize every possible variant in a source code.
So every loop becomes canonical loop, if-then statement — canonical if-then, and so on.
Let’s get back to the first example:
Here is corresponding LLVM IR listing, but slightly modified for the sake of better readability.
Canonical if-then code:
First, there is the code that checks the number of arguments if (argc != 2).
Before: icmp ne — compare if not equal instruction. Now we see icmp eq — compare if equal. An optimizer doesn’t need to handle the icmp ne instruction even though it is supported by the LLVM IR.
Another difference is that the if.then label and code execution path are gone. Now return label is handling the same logic with a phi node.
In that way, control flow of a program becomes visible to an optimizer, and it is able to make more accurate decisions.
I think that is what sets C++ aside from every other language I’ve ever worked with is the ability to build abstractions which collapse.
~ Chandler Carruth “Opening Keynote Meeting C++ 2015.”
From the optimizer’s point of view, there are three main abstractions:
1. Functions, calls, and the call graph 2. Memory, loads, and stores 3. Loops
A function is the fundamental human abstractions — but a processor doesn’t need it. At this point, an optimizer has a chance to reshape instructions in a more CPU-friendly manner, and remove lots of them for good.
LLVM tries to partition a call graph on clusters of functions to operate in a broader context. Leaves of this graph are natural candidates for inlining as they don’t call any other functions. Inlining, thus removing leaves, creates new leaves that can also be removed. Then repeat. This decision doesn’t have to be binary — partial inlining and a bunch of crazy shuffling possible.
To inline or not — this decision is the most critical for producing fast code.
Luckily, many functions were made to be just shortcuts to other functions. These should be easily removed.
But there might be hard cases when to inline or not decision depends on an argument value or branch condition. Such cases will be postponed until LLVM would find more information. You may want to pick a sort algorithm based size of a vector. An optimizer will climb an SSA form tree, eager to figure actual size to make this decision. Unfortunately, cross-function optimizations may require the enormous amount of resources to get it right, and there may be infinite number of options to choose from. There are still cases that may seem trivial first but where optimizers produces not the prettiest code.
HHVM (HipHop Virtual Machine) is a PHP virtual machine with JIT. It was created to replace the HipHop (PHP to C++ source transpiler) at the Facebook.
When all Facebook’s code runs in HHVM after 12 months of development, it was 7x slower⁸.
I suggest you to look at the presentation slides, but one thing I would like to outline:
When code makes unrelated code faster or slower, suspect caching.
~ Keith Adams “PHP on the Metal” slides
HHVM team made an impressive investigation to find the culprit: aggressively inlined memcpy. The code of the memcpy function size was 11KB on their platform and caused a cache thrashing. Useful data was evicted from the CPU cache by this code everywhere in a program.
Their solution was to implement a simpler version of memcpy that wouldn’t have several code paths depending on the size of a copied data, CPU model, etc. Simpler function’s code fit into two cache lines instead.
The new memcpy performed much worse in different memory micro-benchmarks, but it significantly improved overall application performance
By the way, keyword inline has nothing in common with the actual process of inlining.
As it was said, compiler’s frontend generates code like memory is infinite.
For the following function:
LLVM produces this unoptimized code:
Every single variable being allocated, stored, and immediately retrieved from memory.
While optimized code looks very reasonable:
The SSA form shines here. An optimizer looks at memory loads and stores and tries to find what actual values are. In the above example, it does it easily.
Now, let’s try slightly more complex example:
Note that the function’s signature was changed. The new signature allows a compiler to put arguments in registers.
Before: (Point * this, Point arg)
After: (Point * this, i64 arg.x, i64 arg.y)
Data flow becomes more complex in this example. We access several memory locations, we store intermediate results on a stack, we dealing with implicit this pointer. The way LLVM tries to reason what’s happening is to partition memory. It attempts to reconstruct individual neighboring loads and stores back to structs and arrays.
Let’s see how optimizer will handle this.
LLVM IR with -O1:
The insertvalue instruction above inserts a value into a member field in an array of struct value. It works by element’s index similar to the getelementptr.
All right. Now, we see no alloca — intermediate struct is gone, and now values can be stored in CPU registers.
In the opposite case, RAM access would be literally 100 times slower.
The fastest code is the one that doesn’t run.
A loop is where a program spends the most of its time, and loops are naturally are main optimizer targets. This is where a lot of research is going on in the compilers world.
Let’s take a look at the following loop:
In total, there are about 160 lines of unoptimized LLVM IR code (keep in mind tons of memory allocations, handful of STL function calls, C++ runtime code). So there’s a lot of stuff going on in this small loop.
Here is a small fragment of unoptimized IR:
LLVM turns these 160 lines into nearly 170 lines of entirely different IR code. We will get to it, step by step.
Assuming all the abovementioned optimization techniques stricken, collapsing memory and function abstractions. At this point, IR code of the sum function may look like this:
Again, there’s a canonical form. It is crusial for optimizer to distinguish a loop from any other control flow statement.
Now we have loop.ph — so-called loop pre-header and loop.exit nodes. Their purpose is to be the only and only loop enter and exit. The loop label has loop.ph and itself as predecessors, and loop.exit has the loop as the only predecessor. Function’s exit label is different as we may get there skipping the whole loop.
loop.exit has this unusual phi node with only one argument (line 23):
%x.lcssa = phi i32 [ %x.next, %loop ]
This is SSA form way of saying that the x.next lives outside of a loop. So a value from within the loop could be taken.
This is the canonical loop form. After recognizing it, several optimization techniques are applied.
First of all, optimizer tries to decide whether it needs to be a loop at all. For instance, if optimizer can figure out the size of an array, loop unrolling will take place. This is something every C++ compiler is actually good at, as the SSA form naturally facilitates constants propagation.
The second class of optimizations tends to move redundant operations out of the loop, and SSA form allows this easily. There is only one loop.ph entry point and loop.exit is the one exit. Optimizer keeps track of every phi node, hence it knows what is modified inside and outside of a loop. There is no need for a programmer to cache vector.size()or similar calls into a local variable — this is no Python. Optimizer will move all unnecessary code from a loop’s body.
Another technique is related to the branch conditions inside of a loop. Generally, optimizer tends not only move condition outside of a loop but to createseparate specialized loops for each case.
After abovementioned techniques are run, a loop code becomes cleaner and smaller, and that’s where it gets optimized for the maximum CPU utilization.
Modern processors have highly efficient single instruction, multiple data (SIMD) pipelines. Hence, an optimizer makes loop vectorization — a particular case of SIMD. In our example with the sum function, this simply means: execute the same add instruction several times per iteration.
An optimized code may look somewhat scary. Before you try to get through it, there’s a short intro.
There are three different loops in this code:
loop.vec32 is the main vectorized loop. Inside this loop we see add nsw <4 x i32> — add two vectors of 4 32-bit integers each (result is also a <4 x i32> vector). Moreover, this loop is unrolled, so it munches 32 x 32-bit integers per iteration. Of course, an array must have 32 or more properly aligned elements.
loop.vec8 is a smaller vectorized loop that works with two <4 x i32> vectors — 8 x 32-bit integers.
loop.scalar is our original loop. All it does is to sum two 32-bit integers one by one add nsw i32. This loop is a fallback when array smaller than 8 elements.
It is hard to follow array size arithmetics — LLVM replaced costly division instructions, like div or mod, with lshr/and/or it works because right operand is a power of two. Except the size tricks, code should be pretty straightforward.
Array of length 129 will hit loop.vec32, then will hit loop.scalar, missing the loop.vec8 branch; while having 130 elements would make it hit all of the branches. Small array of 7 or lesser elements would only go through the loop.scalar.
A compiler will only optimize for the average case under the “do no harm” slogan. You should know specific code paths and high-level data flow that needs to be tuned.
In many cases, it is not that hard to eliminate a bottleneck, but to find a cause — it is a real challenge. Non-sampling and built-in profilers may interfere with the original code and mask performance problems. Sampling profiler loses important information and can skew results due to the precision error.
The more you know, the better. Understanding your tools and environment is essential. How does profiler work? What compiler flags will affect an investigation? What they do exactly? How to reduce hardware and OS noise? What’s happening inside that long syscall? Are these instructions efficient on this CPU model? Am I measuring the performance of the empty loop? Chandler Carruth has another wonderful keynote on benchmarking and using profiler⁹ (I can’t help to recommend this guy — it’s his bread and butter to make our C and C++ code run faster).
Understanding the assembly code doesn’t hurt. Well. It hurts less over time. Agner Fog’s site¹⁰ is a great starting point to learn hardware and low-level optimizations techniques.
Medians, averages, and 95th percentiles often mean little to none; contrary — they can wrap up all useful signals with a noise.¹¹ Interpreting benchmark results may be tricky. Learn you some statistics and the theory of experiment for great good.
Unless there are very CPU-intensive calculations going on, memory loads and stores become the bottleneck.
The problem is complex, and there are many aspects to look at.
Modern CPUs tend to hide from us the fact that memory is slow. Understanding of how hardware works are crucial to getting the maximum out of it.
“What every programmer should know about memory” by Ulrich Drepper¹² is a must read if you care about the performance.
Sequential access is so crucial that sometimes naive sort algorithms may regularly outperform the most intricately optimized ones on particular datasets. Matrix multiplication may perform times better depending on a rows/columns access order.
CryptoNote cryptocurrency algorithm tries to achieve protection from hardware miners (ASICs) by using a large chunk of memory that needs to be accessed randomly. Random memory access hurts parallel execution in many ways.
Some processors made us believe that there’s such thing as unaligned access. Sometimes we are even lucky enough not to meet the performance penalty (if we don’t touch CPU cache line boundaries). Even though simple read and write operation may go smoothly on x86-based CPUs, atomics and SIMD instructions require strict data alignment. Unaligned memory access will trigger the hardware panic on the most ARM processors.
Some sort of serialization/deserialization code is a common source of a misaligned data.
Again, there are many aspects to look at to get the optimal memory layout.
The simple rule of thumb (from the performance point of view): what is often accessed together should be kept together.
Let’s consider the following example. Imagine we want to look up some elements based on their bit mask. Something like this:
Here we pump every element (32+ bytes) through CPU caches just to erase data from caches and throw it away on the next iteration. But we can utilize cache much better by shuffling memory regions around.
Let’s remove int mask from the Foo struct, and move all masks to a separate array:
In that way, CPU caches are not wasted on a useless (in our case) value1, value2, …, and a compiler will turn the loop body: foomasks[i] & needle into a vectorized code that munches several masks at once. Such optimization may show substantial improvement. (Provided that an items count is high, and a mask selectivity is low).
But now we have two arrays to nurse and carry around. Code readability definitely suffered, and a handful of nasty bugs became possible.
If you are going to write a big amount of memory for later use, you may want to write it bypassing a CPU cache. In that case, processor issues only write instructions without prior memory reading, and what is more important, caches will not be polluted with the data you don’t need right now.
The _mm_stream_siXXX intrinsics family produces loads using a non-temporal memory hint.
Processors do this most of the time, but we have the upper hand here by knowing our specific use-cases and needs. Just keep in mind that it takes time to get the data, so there is no use to prefetch data right before you need it. We have to prefetch while still having some work to be finished.
One way to prefetch the data is to issue compiler’s built-in instruction, like __builtin_prefetch on GCC or _mm_prefetch from the xmmintrin.h header.
Another way would be to start reading the desired data from another thread.
Of course, it’s all easier to say. Prefetching data upfront may cause cache thrashing and may lead to the very opposite of the expected results.
Memory access and branch prediction topics are highly correlated. One thing that may come in handy is to tip a compiler with a particular branch probability. This can be done either with a famous likely/unlikely macros (GCC’s __builtin_expect) or better, with a profile-guided optimization.
You may run a program under your typical workload and create a profile of that execution. This profile can be used as an additional input to an optimizer. Data may include jumps on a branch count, frequencies of function calls, and other information that allows a compiler to make more accurate decisions depending on a provided workload.
From my experience, however, it doesn’t give a huge performance improvement. Such optimizations should be considered as a last resort to obtain a small overall boost.
Often there is no single bottleneck, so it is unclear where even to start. In such cases, hardware CPU performance counters provide priceless information: number of mispredicted branches, number of cache misses for code and data, memory writes, etc. So instead of reducing unknown function’s latency, you may look into improving these.
Sometimes manual optimizations, like loop unrolling, or replacing branches with a pointer arithmetics, make a significant local improvement but may surprisingly hit the overall performance.
There’s always a tradeoff between the code readability and the performance. Care to make your code more human-readable. Don’t try to impress a compiler and your fellow humans with “100 Amazing Bit Twiddling Hacks.” All sophisticated tricks you know will be outdated at some point. Hardware behavior will change. A codebase would live its own life.
Let’s take our sum function from the loops example and call it with a constant value:
LLVM IR (-O2):
The whole program was executed at compile time. All that std::__XZZF_you_re_sick_of_this_cr*p_::__iterator__Stable_v1_XYZ_operator is gone. Vector’s heap allocations through std::allocator calls are gone — return 10 is all that’s left.
I don’t think this example is fair — SSA form makes constant propagation a breeze, but I believe it is very impressive.
This code may give an optimizer a hard time. Now decision: “whether to inline foo” depends on a memory access (data dependency), and on a bar function (code dependency).
Unfortunately, there is no such thing for optimizer as a const pointer.
Imagine situations like this:
…or a pointer to pointer cast, or array tricks, or integer overflows, or buffer overflows. Not to mention the dull const_cast, there are endless possibilities to alias the same memory region in C and C++, either intentionally or accidentally.
Second thought is parallelism. If there is no shared data, there is nothing to lock on. Modern multi-core CPUs are capable of highly parallel execution, and there is no need to spend precious cycles on locks (we prefer to mine cryptocurrencies instead).
In many cases, returning arguments instead of passing them by reference enables a compiler to see a data flow and to produce more efficient code.
This is another way to make things more complicated for an optimizer:
Now optimizer have to think about this pointer and reason if there are side effects. Humans also would struggle to read such code.
Pure functions are much easier to read for both: humans and compilers.
C++11 and newer standards provide move semantic and return value optimization (RVO) is guaranteed.
Both, C and C++ have too many cases where behaviour of our programs rely on compiler internal implementation. Modern standards try to removing this dark corners and minimize possible damage from UB.
Also, newer compilers provide supplementary tools like: linters, checkers, formatters, etc.
Try to spot an error in this manually unrolled loop:
The above example is from the great article from the PVS-Studio static analyzer team.¹³
Line 14: there is the a left from copy-paste, where a should be.
Again, loop unrollign is something C and C++ compilers can do well without any help from a programmer.
Of course, this is true only for the happy path. CPU has to execute frame finalizers and jump through the stack causing a bunch of cache misses, TLB shufflings, etc. when an exception was thrown.
There may be other reasons not to use SEH in C++, but performance is not among them anymore.
You wouldn’t simulate pattern matching with exceptions anyway, would you?
It is not a performance tip, but since you here:
The most important thing I have done as a programmer in recent years is to aggressively pursue static code analysis.
~ John Carmack from “In-Depth: Static Code Analysis” on Gamasutra¹⁴
Modern tools are unbelievably useful. In my experience, even a false-positive alarm often points to an unnecessarily complicated code. Nowadays, a compiler itself is a great analyzer. All you need is to turn on them -Wall at the cost making a code a little bit cleaner — win-win.
Optimizing compilers are so difficult to get right that we dare say that no optimizing compiler is completely error-free!
Our whole industry is not perfect. Heck, we’re all terrible.
Obligatory xkcd reference:
An optimizing compiler may surely have some bugs; it may, for instance, throw out security-related code because there’s no explicit usage for it; Compiler even may even introduce new vulnerability itself.
Someday, powerful AI will come to clear our bugs, until then we shall keep our code stupid simple to minimize chances of encountering bugs in compilers.
Software systems complexity is on a growing trend
A hardware becomes more complex and diversified
CPU is fast and sophisticated; DRAM is slow and cheap
We have to delegate the most low-level CPU optimizations to a compiler and focus on a program’s data flow and producing human readable code
Understanding performance also means understanding optimizing compiler
Modern optimizing compilers are very capable beasts but still far from ideal
Memory layout and access patterns optimization pays handsomely
Local optimizations sometimes may show amazing results on a micro-benchmarks while crippling overall application performance
There’s some Makefile Voodoo on GitHub If you ever wanted to translate a piece of C/C++ into a LLVM IR and assembler listings — just put a ‘.c’ or a ‘.cpp’ file in that directory and run make. It should produce ‘.ll’ and ‘.asm’ listings for each optimization level: -O0..-O3 -Os
On Mac with Xcode and command-line tools installed it should work instantly. Probably.