In 2013, we introduced EVCache a distributed in-memory caching solution based on memcached that offers low-latency, high-reliability caching and storage. It is well integrated with AWS and EC2, a Netflix OSS project, and in many occasions it is termed as “the hidden microservice”. Since then, EVCache has become a fundamental tier-0 service storing petabytes of data and hundred of billions of items, performing trillions of operations per day, has the ability to persist the data to the disk, and has a footprint of thousands of servers in three AWS regions.
With the advent of Netflix global cloud architecture we are able to serve requests for a Netflix customer from any AWS region where we are deployed. The diagram below shows the logical structure of our multi-region deployment and the default routing of member traffic to AWS region.
As we started moving towards the Global Cloud, we had a three times increase in the data that needed to be replicated and cached in each region. We also needed to move this data swiftly and securely across all regions. Supporting these features came at a considerable increase in cost and complexities. Our motivation was to provide a global caching solution which was not only fast but was also cost effective.
Storing large amounts of data in volatile memory (RAM) is expensive. Modern disk technologies based on SSD are providing fast access to data but at a much lower cost when compared to RAM. Hence, we wanted to move part of the data out of memory without sacrificing availability or performance. The cost to store 1 TB of data on SSD is much lower than storing the same amount in RAM.
We observed during experimentation that RAM random read latencies were rarely higher than 1 microsecond whereas typical SSD random read speeds are between 100–500 microseconds. For EVCache our typical SLA (Service Level Agreement) is around 1 millisecond with a default timeout of 20 milliseconds while serving around 100K RPS. During our testing using the storage optimized EC2 instances (I3.2xlarge) we noticed that we were able to perform over 200K IOPS of 1K byte items thus meeting our throughput goals with latency rarely exceeding 1 millisecond. This meant that by using SSD (NVMe) we were able to meet our SLA and throughput requirements at a significantly lower cost.
EVCache Moneta was our first venture at using SSD to store data. The approach we chose there was to store all the data on SSD (RocksDB) and the active/hot data in RAM (Memcached). This approach reduced the size of most Moneta based clusters over 60% as compared to their corresponding RAM-only clusters. It worked well for Personalization & Recommendation use cases where the personalization compute systems periodically compute the recommendations for every user and use EVCache Moneta to store the data. This enabled us to achieve a significant reduction in cost for personalization storage clusters.
However, we were unable to move some of the large online and customer facing clusters as we hit performance and throughput issues while overwriting existing data (due to compactions) on RocksDB. We would also exceed the desired SLA at times. As we were working towards solving these issues, Memcached External Storage (extstore), which had taken a different approach in using NVMe based storage devices, was announced.
Memcached provides an external storage shim called extstore, that supports storing of data on SSD (I2) and NVMe (I3). extstore is efficient in terms of cost & storage device utilization without compromising the speed and throughput. All the metadata (key & other metadata) is stored in RAM whereas the actual data is stored on flash.
With extstore we are able to use the storage device completely and more efficiently which we could not do achieve with Moneta. On Moneta based systems we could use at most 50% of the disc capacity as we had to ensure an old item could be deleted (FIFO compaction) only after it was written again. This meant we had could end up with a copy of new and old data for every item thus having a need for 50% disc utilization. Since we did not have need for storing duplicate records in extstore we are able to reduced the cost of extstore based EVCache clusters significantly. At this point, most of the EVCache clusters are scaled to meet network demands rather than storage demands. This has been quite a remarkable achievement.
By moving from Moneta based clusters to extstore we are also able to take full advantage of the asynchronous metadump command (lru_crawler), which allows us to iterate through all of the keys in an instance. We use this to warm up a new cluster when we deploy a new version of memcached or scale the clusters up or down. By taking advantage of this command we can also take snapshot of the data at regular intervals or whenever we need. This ensures data in EVCache is durable and highly available in case of a disaster.
The performance is also consistent compared to Moneta and rarely exceeds our SLA. Below is a log of disc access via iosnoop for read operations from one of the production cluster which is used to store users personalized recommendations.
Below is a histogram plot of the read latencies from the log above. The majority of reads are around 100 microseconds or less.
Below is the average read latency of one of cache comparing Moneta (blue) vs extstore (read). extstore latencies are consistently lower than Moneta for similar load across both the instances.
With extstore we are able to handle all types of workloads whether it is a read heavy, write heavy or balanced. We are also able to handle data sets ranging from gigabytes to petabytes while maintaining consistent performance.
It has been quite a journey to move from Moneta to extstore and as of now we have moved all our production clusters running Moneta to extstore. We have also been able to move some of the large RAM based memcached clusters to considerably smaller extstore clusters. The new architecture for EVCache Server running extstore is allowing us to continue to innovate in ways that matter. There’s still much to do and If you want to help solve this or similar big problems in cloud architecture, join us.