Courtney Wang (u/wangofchung)
Reddit’s engineering team and product complexity has seen significant growth over the last three years. Facilitating that growth has taken a lot of behind-the-scenes evolution of Reddit’s backend infrastructure. One major component has been adopting a service-oriented architecture, and a significant facet of that has been evolving service-to-service discovery and communication.
As the number of services has grown, so has the complexity in how they interact with each other and legacy systems. Instead of debugging function and module calls within a monolithic application, engineers now need insight about RPCs among multiple services. Instead of focusing on common problems like exception handling and bad input, engineers also have to consider client request behaviors and defend appropriately with retry-handling, circuit-breaking and granular route control.
Recently, we rolled out Envoy as our service-to-service L4/L7 proxy as part of our efforts to address these new and ever-growing needs for developing and maintaining stable production services. In this blog, we’ll provide insight into Reddit’s service communication beginnings, why and how we chose Envoy as well as how we approached and managed deployment of the tool given our infrastructure constraints.
Mesh Up: Reddit’s Service Discovery Foundation
Ever since we started building out new services and splitting features out from our monolith, Reddit’s backend infrastructure has been operating within a very basic service mesh powered by Airbnb’s SmartStack. Much of the reasoning behind SmartStack’s design choices and use cases mapped to our own architecture, making it a natural fit for our transition into the new world of service-oriented infrastructure. A brief overview of this architecture is important as the groundwork was pivotal in how we approached our eventual migration to Envoy and greatly reduced the difficulty of the task itself.
Today, the majority of our services run on AWS EC2 instances in AutoScaling Groups. As service instances come and go throughout the day, registration happens with a SmartStack component called Nerve. Nerve is a Ruby process that runs as a sidecar on each of these instances and registers the instance into a central Zookeeper cluster. The service application is responsible for constructing a useful health-check to determine registration status. Most of our services are instrumented with a common framework called Baseplate that provides shared tooling like a health-check interface, which makes interfacing with Nerve straightforward and abstracted to the service developer.
The other component to a basic service mesh is discovery, the process that makes new downstream services available to be called by upstream client services. Like Nerve, Synapse is a per-instance Ruby process that manages service endpoint discovery. Synapse reads the Zookeeper registry that Nerve populates, writes endpoint entries to a local HAProxy configuration file. HAProxy runs as a sidecar process, handling proxying and load-balancing downstream service traffic.
While much of our service architecture has been constantly changing, Reddit’s SmartStack deployment has remained relatively unchanged and operationally stable throughout its three-year lifetime. The service registry provided by Nerve has been used for other internal tooling, such as checking for unhealthy AWS instances and monitoring bootstrapping of new hosts. Despite this stability and familiarity, our evolving systems started pushing against the limits of what’s immediately available with SmartStack, so we decided to evaluate the service mesh landscape and determine if it’d be beneficial to replace it with new technologies.
One Mesh More: Moving To Envoy
As our service architecture has evolved over the last few years, there were several places in SmartStack where we started to feel pain, primarily around the lack of control for the individual components. Nerve and Synapse only accept static configurations, so any service registration updates––such as adding or removing a service––required Puppet configuration changes and updates across a service fleet. Synapse’s configuration writer for HAProxy provided only basic routing definitions and we had minimal levels of observability for the traffic going through HAProxy since it didn’t understand our primary internal protocol, Thrift.
Service-to-service behavior was also increasing in complexity, and we were noticing developers trying to enforce complicated communication behaviors in application code. Every application was starting to manage its own retry-handling, timeouts, and circuit-breaking to downstream services in its own code, and it got to the point where a service had four or five upstream clients with different behaviors against the same endpoint. We knew this would not scale and wanted to explore the possibility of managing that in a shared communication layer like our network proxy.
Now that we had defined a destination that we wanted to reach with our service mesh, we needed to figure out the road that would fit Reddit’s the best. We evaluated many proxy and mesh options, focusing on the following characteristics when weighing tradeoffs:
- Performance: Avoid adding a performance bottleneck at all costs. Any performance losses at the proxy level need to be offset by considerable feature gains. The two biggest considerations here were resource utilization and latency impact. Our mesh approach accounts for a sidecar proxy on every host, so we wanted the solution to be one that we were comfortable running on every host and at every hop in the network.
- Features: The biggest differentiator among the options was the possibility of L7 Thrift support in the proxy. Thrift is our main inter-service RPC protocol and without first-class support for the behavior control we want in a service mesh, it wouldn’t make sense to switch to something that would just be providing the same basic TCP load balancing we’re getting out of HAProxy. We’ll address this in the next section.
- Integrations and Extensibility: Being able to contribute or request integrations and possibly extend out-of-the-box functionality was also a core requirement. The network proxy needed to be able to evolve with Reddit’s service needs and developer feature requests.
Envoy and its ecosystem fit all of these requirements with tradeoffs that we decided would be worth the migration. While we’d have to dramatically change our service discovery stack and get comfortable running an entirely new traffic proxy in production, the low footprint and extensibility potentials, especially with future Thrift filters, were big opportunities for our infrastructure evolution. With a well-defined feature-set goal and implementation choice made, the hard part was done. Next came the harder part of actually deploying and using it in production.
HAProxy Replaced: First Steps with Envoy
As mentioned before, HAProxy functioned as the service traffic proxy in Reddit’s SmartStack deployment, and within that deployment it could only manage traffic at L4. A great deal of Envoy’s advanced feature set that differentiated it for us was its L7 control. At the time, there was already mature functionality for HTTP and gRPC, but Reddit services primarily communicate using Thrift. The first thing we needed to get started on was ensuring that Envoy had a Thrift story. At this time, Turbine Labs had just announced their Envoy support and plans to develop products for Envoy, and we noticed that they were actively contributing to Envoy. We reached out to them about the possibility of contracting development of Thrift support in Envoy, and over the next four months, the partnership brought Thrift support into Envoy proper, contributing a basic Thrift-aware proxy, routing, request/response metrics, and rate-limiting.
In parallel, internally we were focused primarily on setting the foundation for deploying and running Envoy with our production services. Without internal service communication depending on this layer, we had to be extra careful in trying to change even just one part of the system, let alone the entire thing. After a good deal of consideration, we settled upon replacing HAProxy with Envoy with basic TCP proxying support as the minimum viable first steps while Thrift capabilities were being built out. Operationally, we would still use Nerve and Synapse to handle service registration and discovery which meant changing fewer moving pieces at once. The tradeoff here was that we wouldn’t be using Envoy’s preferred mode of operation, using dynamic discovery services, out of the box. To deal with this, we wrote a custom ConfigGenerator plugin in Synapse for writing Envoy configurations and also additional configuration scripts and tooling for Synapse itself to manage Envoy using its hot-restart mechanism.
This effort allowed us to keep most of our service discovery system intact while still operationalizing Envoy in production, which in turn gave us a good deal of flexibility with the migration itself. We could roll out the proxy change to each service one at a time and control rollout of Envoy usage on a per-instance basis by changing application configurations to use Envoy-specific endpoints for downstream communication. During the migration process, we ran both HAProxy and Envoy in parallel, listening on different ports, with Synapse writing configurations for both proxies. This enabled us to perform per-service migrations for the client, observe behavior, and roll back immediately with a straightforward configuration change if anything broke. We could also audit Envoy configurations against HAProxy’s to verify the correctness of our Synapse configuration generator. Over a period of two months, we deployed Envoy to individual services and then flipped service configurations to send downstream requests through Envoy instead of HAProxy, monitoring and ensuring no behavioral or performance regressions along the way.
Migration hasn’t been totally without bumps and bruises, primarily around our custom Synapse and Envoy integrations and management of static Envoy configurations. Envoy’s network connection handling, especially across hot restarts, differed enough from HAProxy to cause unexpected errors in our application connection management code as well. None of these issues have been show-stopping, and as a result production traffic has been smoothly running through Envoy for nearly four months.
Performance-wise, Envoy has had no measurable impact on our service latencies compared to HAProxy. HAProxy is still running as a sidecar to facilitate quick emergency rollbacks during the holiday season while engineering resources are a bit thin, so we haven’t yet been able to measure resource utilization impact on our hosts. Envoy has also provided more observability at the network layer for us, especially after we enabled the Thrift filter on a few internal services. Thanks to the filter instrumentation, Envoy has started to provide request and response metrics that we didn’t have access to before without any application code changes. These small but impactful improvements and the overall operational stability have given us the confidence to continue pursuing our larger service-mesh roadmap for 2019, with Envoy as the engine to power it.
What Comes Next?
With Envoy rolled out at the proxy level and managing L4 traffic, we’ll spend the first part of 2019 solidifying operational tooling in order to start really leveraging Envoy’s control capabilities. We plan to deploy an implementation of Envoy’s discovery service API backed by a centralized configuration store to bring dynamic configuration to our Envoy deployment and provide an interface for developers to manage those configurations on a per-service basis. We’re also starting to leverage Kubernetes more for internal services, so building out that part of our Envoy deployment will be seeing much more development starting early next year. A primary area of interest there is in using Envoy’s route management tooling to facilitate service migrations into Kubernetes.
Finally, while Envoy is currently used for most of our backend service communication, we still leverage AWS ELBs for some external ingress points and also use HAProxy as the main external load-balancing and routing layer for the core Reddit backend application. Running Envoy at the edge will provide substantially more observability and service routing control. We’re hoping that being able to do complex request-management operations such as shadowing inbound traffic and traffic-shifting at the edge will pay dividends as we continue splitting the monolith into smaller, more-manageable services.
Rolling Envoy out to production has been a massive undertaking this year for Reddit and couldn’t have been done without a ton of internal and external support. We’d like to thank the Envoy community for building and maintaining a tremendous piece of software and also Turbine Labs (especially Mark McBride, TR Jordan, and Stephan Zuercher) for building out the foundation for Envoy’s Thrift capabilities. Envoy also couldn’t have happened at Reddit without the internal support of the entire engineering team, especially u/alienth, u/cshoesnoo, u/NomDeSnoo, and u/spladug. Stay tuned as we continue to develop Reddit’s story around network observability and control. Interested in being a part it? We’re hiring.