We’re coming up on the tenth anniversary of LMDB, and I’ve been thinking
back to the bad old days of cache tuning that we struggled with, before
LMDB’s advent. As noted in the LMDB design doc tuning caches used to
be a major pain point in administering OpenLDAP. It was a constant juggling
act due to the presence of at least 3 different layers of caching being
in effect, each with different space and time characteristics. While we’ve
been free of this burden for nearly 10 years now, we gained a lot of hard-won
knowledge about caching in the years before that. Most of that knowledge
is no longer necessary for operating OpenLDAP, but there’s still a lot
of software in use today that tries to manage their own caches, and the
lessons we learned are still valuable in those other contexts.
Who/What is LRU?
The slapd-ldbm, slapd-bdb, and slapd-hdb backends all used their own
application-level entry caches to optimize read performance. These caches
were essential because the underlying DB libraries (gdbm, bdb, ndbm, etc.)
performed quite slowly on their own. The same cache management policy was
used on each, LRU: Least Recently Used. This is one of the simplest to
understand and simplest to implement, and it’s widely documented.
The typical implementation in C uses a doubly-linked list. Cached
elements are added at one end of the list when they’re used, and dropped
from the other end of the list when the cache size limit is reached. Any
cached elements that are referenced again are pulled out of their spot
in the middle of the list and added back onto the end.
It turns out that this data structure performs very poorly with higher
levels of concurrency; the head and tail pointers of the list must be
mutex protected, as well as the pointers within each cache element. And
every read operation in the server actually expands to a write operation
in the cache, pulling an entry out of its old position and tacking it on
to its new position.
One of the major improvements we made to slapd-bdb/hdb caching was to
replace this LRU mechanism with CLOCK in 2006.
Logically it is the same strategy as LRU, but the implementation uses
a circularly linked list. Instead of popping pages out of their position
and tacking them onto the end, there is a clock-hand pointer that points
to the current head/tail, and a Referenced flag for each entry. When an
entry is referenced, the hand is just advanced to the next entry in the
circle. This approach drastically reduces the locking overhead in reading
and updating the cache, removing a major bottleneck in multi-threaded
Important Lesson #1: If you’re using plain LRU in your code, stop.
Use CLOCK instead.
Cold Hard Cache
Aside from classical LRU being terrible for multi-threaded performance,
it has some pathological cases where it cannot offer any performance
benefit at all. In particular, if the cache has a maximum size of N
entries, and a sequence of operations arrive that touch M > N entries,
the cache effectiveness goes to zero.
For example, given a cache with max size 4, and a request that accesses 5
entries A, B, C, D, and E, the cache will grow as:
A A B A B C A B C D B C D E
At every step an entry is added to the cache. Since there are no repeats
in the sequence, none of the cached entries are reused while executing
the sequence. At the last step, because the cache is full, entry A is
evicted to make room for entry E to be added.
If the operation accessing this sequence of 5 entries is repeated, the cache
contents will become
C D E A D E A B E A B C A B C D B C D E
None of the cache’s contents will ever be reused to service the request,
because the next entry being requested is always the one that was just
evicted a moment before. In these situations the cache is just a waste
of time and memory.
This is a major weakness in LRU and its related cache management schemes.
One of the solutions proposed to this problem is the Clock-Pro algorithm.
We experimented with this back in 2007 but with poor results.
One of the difficulties in working with these experimental algorithms is
that they tend to be developed in the context of managing virtual memory
pages in a demand-paged operating system kernel. A kernel has many low
level high performance facilities at its disposal that have no equivalent
in user-level application space, or whose usage is more expensive in
user-land. In the case of Clock-Pro, it may have worked in kernel
space but adapting it to user level was a very poor fit.
Eventually a simpler solution was arrived at: just stop adding entries
to the cache if the number of entries accessed in the request exceeds
the maximum size of the cache. This allowed the cache effectiveness to
approach 100% (every entry in the cache gets reused) and was a huge
performance boost for these pathological operations. To illustrate
with the same cache and operations as in the previous example, the
cache grows almost the same as before:
A A B A B C A B C D A B C D
But when we process entry E, we discard it instead of caching it. Then
when the sequence is repeated, entries A, B, C, and D are already in
the cache and are processed at basically zero cost. We simply pay the
cost of fetching entry E again. In this example, the repeated operation
runs 5x faster than the uncached operation.
This simple optimization is one that works because we have higher level
knowledge about the request being performed. It’s an advantage that we
gave up, in adopting LMDB and leaving all cache management to the OS.
(Advantages like these are reasons why most DB engineers claim that DBs
can manage caches better than OSs can.) In practice, memory management
in systems like Linux has advanced beyond simple LRU, with features
adopted from Clock-Pro and other designs.
So at least on Linux, cache performance hasn’t been an issue with LMDB.
It’s noticeably worse on Windows, as is memory management in general
on Windows, but obviously Windows is not a high performance platform
to begin with.
Important Lesson #2: Exotic solutions aren’t always all they’re
cracked up to be; take a step back and you may see a simpler option.
Despite the weaknesses of LRU in some rare corner cases, it still
performs well in general. This is especially true for hierarchically
structured data, due to the nature of these structures. For example,
given a directory tree like this:
dc=com ..dc=example ....ou=people ......cn=Alice ......cn=Bob ......cn=Carol
A lookup for cn=Carol,ou=people,dc=example,dc=com will touch these
entries and cache them:
dc=com dc=example ou=people cn=Carol
Assuming the cache size limit is 4, then a lookup for
cn=Bob,ou=people,dc=example,dc=com will touch these entries and
dc=com dc=example ou=people cn=Bob
The LRU policy will naturally evict the cn=Carol entry, while preserving
the other entries. So with tree structured data you will always get the
maximum possible effectiveness from an LRU-based cache. This same principle
applies to the B+trees used internally in LMDB. Because tree traversals
always start at a tree root, the tree root will always be hot in the cache,
and frequently used branches will also stay hot in the cache, with only
the leaves being cold. So the natural access pattern of tree-oriented
data structures inherently maximizes the effectiveness of system caches,
without any additional effort required. This is one reason why OpenLDAP is
still the world’s fastest, most efficient distributed database today, while
requiring no cache tuning.
Important Lesson #3: Getting your data model right is more than
half the battle.
A lot of intensive work is still ongoing in designing optimal caching
algorithms. Even if you have a great design, realizing it in a good
implementation can still be a difficult challenge. (Looking at the
commit history of the old caching code in OpenLDAP gives a hint of just
how much effort it takes to get it right.) For locally stored data the best
cache management strategy these days is “let the OS do it for you.”