a distributed memory object caching system

By Dormando

Caching beyond RAM: Riding the cliff - Dormando (February 04, 2019)

Key value stores are no longer confined to caching small objects over a local network. memcached is today deployed in many different environments, including geographically diverse datacenters, data warehouses and machine learning training systems. These new layouts have items sized kilobytes to megabytes, giving value to stitching devices together or tiering devices of different speeds.

Our previous post introduced Extstore, an extension to memcached which allows moving cache data onto flash storage. Working with Intel and Packet, we tested extstore on high speed flash SSD’s and Optane NVMe devices. These tests left many interesting questions unanswered however:

  • how can we manage expectations of latency at various throughputs?
  • how does extstore perform as the number of devices increases?

In this post we will attempt to answer these questions, running tests under both higher scrutiny and worse conditions than before. memcached can process over 50 million keys per second on a 48 core machine using only RAM and heavy batching. While impressive, real world scenarios are filled with overhead: small individual requests, syscalls, context switching, and so on. To avert potential benchmarketing, we’ll attempt to replicate these nuances when testing across RAM, flash and Optane devices.

Multi-device Experiments

Tail Latency can cause a minority of slow backend requests to impact a majority of user traffic. While response time can be improved with caching systems, services with a high response cardinality can instead be further impacted. Low hit rates can mean the time taken to query a cache service is just a time tax on those missed requests.

Interesting research now exists for shared cache pools behind services with many backends. This covers some but not all use cases, as backends have increasingly large responses that can overwhelm RAM-backed caches. The geographic expansion of applications can also drive cost as every service has to be deployed in every location.

To attempt to solve these issues by expanding storage space code was tested in two configurations. For the benchmark we used “Just a Bunch of Devices” (JBOD). All devices create one large pool of pages which spreads read and writes evenly. JBOD support was released in memcached 1.5.10.

Another approach we did not benchmark was tiered storage. With extstore, stored data are organized into logical buckets. These buckets can be grouped onto specific devices, allowing users to deploy hardware with a mix of small/fast and large/cheap NVMe devices. Even networked block devices could be used.

The draft can be followed here. To be detailed in future posts.

Test Setup

Testing used tag 4 of mc-crusher, under the “test-optane” script.

The test hardware was a server with dual 16 core Intel Xeon (32 total cores), 192G RAM, 4TB flash SSD, and 3x 750G optane drives.

CPU and memory were bound to NUMA node 0, perfoming like a single socket server. This should be close to what users deploy with, and will allow us to focus our findings.

Other testing we’ve done focused on throughput with sampled request latency. Heavy batching was used over few TCP connections to allow fetch rates of millions of keys per second. This test focused on a worst case scenario in both single device and JBOD configurations. Instead of few connections blasting pipelined requests, we have hundreds of connections periodically sending sets of individual GET commands.

A RAM-only test exists as a baseline: showing our worst case configuration without touching extstore at all. It’s important to compare the percentiles and scatter plots of RAM vs the drives in question, to determine the impact of moving cache to disk.

  • Item values are all 512 bytes in size. With buffered IO performance was the same up to 4k, but smaller values were used to reduce start time for each sub-test.
  • Tests were scaled by a target request rate, increasing by 50k gets per second.
  • GET requests were pipelined together in stacks of 50 from the bench client to the server, which saves syscalls mainly for the benchmark client.
  • Every key fetch must be individually requested from extstore. If allowed to batch extstore is able to process many IO requests with fewer syscalls.
  • Memcached will make at least one network syscall for each GET response.

This lands us in a middle ground, with a bias toward the worst case. Since most users have some level of batching via multi-key fetches, this benchmark should be realistic.

We plot the latency percentiles as the target request rate increases. We also have a point cloud of every sample taken over time. This shows outliers and how consistent the performance is during each test.

It’s important to stress how using percentile averages to measure performance is problematic. We use them in the first graph to identify trends. We then complement with the latency summaries and a scatter point graph of the full duration of the test. Without this, it’s possible to have tests which look fine in the line graph, but in reality have severe outliers or drop all traffic for several seconds.

Latency Percentile: - Mouse over your production query rate to see a detailed breakdown

Results

  • Partway through the graph average latencies take a huge jump: This is the peak performance rate of that device configuration. For example, between 500k and 550k the SSD is no longer increasing throughput, and requests are queueing for longer.
  • When moving to multiple devices, the latency cliff comes later but is worse than single device. Three Optane drives perform almost the same as two. This is due to lock contention on the filesystem buffer cache.
  • As you dial the percentile back, 90 and below, multiple devices continue to outperform single devices. Meaning we get more outliers as contention increases.
  • The benchmark does not have perfect scheduling. A lot of outliers (especially under the RAM test) can happen due to periodic dogpiling of requests. This can mean the 95th percentile results are closer to what users will see.

Conclusions

  • If your entire request rate is below the latency cliffs, and the latency is acceptable, you can reduce RAM very aggressively. The Optane drive is much closer to RAM’s performance under this worst-case scenario, but the SSD performs admirably as well.
  • If your request rate is above the latency cliff, adding RAM is a straightforward fix. Most workloads heavily bias towards recency, which means moving a relatively small percentage to RAM can greatly reduce pressure on extstore.
  • NUMA can help give access to more RAM and CPU, but single socket machines give the most consistent latency.
  • Future work to improve internal batching and support for direct IO could have a huge impact on performance.

The extra tests we show here demonstrate a baseline of a worst case scenario for the performance of a single machine with one or more devices. With request rates around 500,000 per second with under a millisecond average latency, most workloads fit comfortably. While expanding disk space works well, further development is needed to improve throughput with multiple high performance devices.