How we build services fast: A look at the Grubhub service framework

By Dylan Drop

Photo by Iker Urteaga on Unsplash

When developing around a service-oriented architecture, the engineers at Grubhub had to roll a lot of code for all the technology, providers, and infrastructure choices we made. Datadog requires a client library invocation any time you want to increment a metric. Splunk must be configured for every new service to ensure it includes appropriate tracking and instance information. Moreover, should we want to leave ourselves open for changes to these integrations, we could incur the cost of adapting our 100+ services to a new integration pattern. The situation can easily spiral out of control, and we wanted to avoid that before it affected our business.

At some tech companies, especially early stage ones, tech teams aren’t eager to enforce many standards around their code and the teams are given free reign over how they integrate with their chosen technologies. While they might restrict their organization to a few languages, database, and cloud provider choices, more is left open to exploration. Libraries to integrate in the chosen language with a given database or cloud provider might be up for grabs; creating new libraries to integrate with a specific function of a given cloud provider could also be on the table. This has several benefits: less time to test out new technologies and smaller up-front development cost.

While allowing teams or contributors to make independent choices on technology can be appealing, companies often find themselves dealing with penalties as a result of this policy. Teams must spend more time writing boilerplate to integrate into other services and tools. They will spend more time testing this code and fixing issues from edge cases in the integrations. Developers who jump from one codebase to another will need to reorient themselves to get the lay of the land before being able to make changes. Finally, this can become a devops nightmare, as there are usually more systems and languages to learn how to support.

Grubhub came up with an alternative: we developed a Java-based service framework akin to Dropwizard. The framework allows us to develop homogenized services that automatically hook into all the vendor integrations we support in terms of data storage, messaging, logging, and more. It enables us to do so while touching minimal boilerplate. In its homogeneity, we’re able to better leverage our engineers’ past experience to get them oriented more quickly all while being more comfortable shipping new features across any domain in our service architecture.

The framework abstracts away the underlying technology so that we can focus on feature development. For example, we perform RPCs using JSON/HTTP, but the framework provides a Java wrapper that obviates the need to know the underlying serialization format. If I wanted to create a service to accept a certain RPC (we’ll name it handlePing), I’d write some code that looks like the following:

public interface MyService {
   PingResponse handlePing(@RpcParam(“request”) PingRequest request);

With a few lines of setup, my service is now ready to accept the handlePing RPC — PingRequest and PingResponse are POJOs (plain old Java objects) defined elsewhere. The framework takes care of marshalling the POJOs to and from JSON and handles any transport-specific logic. I simply need to add the annotations we’ve defined in our framework — RpcMethod and RpcParam — the framework takes care of the rest.

Invoking the RPC is simple:


Because the serialization format and transport are not dictated by code, I could easily switch them out if I so choose. If I were to do so, developers working on my service wouldn’t have to be well-versed in the details of the new serialization format and transport, as the framework masks all of that for them. All they have to be familiar with is Java and the semantics we’ve defined. So long as the framework integrates with the underlying RPC solution in a consistent way, I can worry less about RPC framework intricacies that can vary from one to another.

When adding metrics and logging into your own RPC or API calls, if you’re not using a framework like Grubhub, you’ll often be forced to write a library that satisfies your internal technology constraints as well as whatever providers you use for metrics and logging. This can result in a lot of headache. As an example, when logging, you might not have a consistent way of tracking a set of requests, so you may not be able to tie together related logs when debugging an issue. Metrics can be even more troublesome: when different developers integrate metrics around code, they may come up with different interpretations of how those metrics should be calculated and different naming schemes. These metrics are harder to find or interpret as a result.

The Grubhub framework integrates these RPCs with our logging management provider (Splunk), producing helpful logs on every RPC:

INFO [2018–02–28 17:23:16,998] RPC-SERVER.request: {“req”: {“jsonrpc”:”2.0", “id”:719737, “caller”:”InstanceInfo(service=MYSERVICE, version=1.5.1820)”, “method”:”handlePing”}}, request-id=106669e0–1cac-11e8-ae4d-f9210915970a, tracking-id=598523ae-c958–4729-b15e-576965e32c1e

Notice that these logs provide service tracing for debugging purposes via the request ID and tracking ID shown. The framework adds a new request ID for each request cycle for RPC requests and API requests, and a tracking ID for grouping together a longer life cycle of events. Should I need to track a series of service calls for debugging purposes, I can query Splunk with this request ID to piece together the history of this request. Furthermore, the RPCs automatically integrate with Datadog to provide a slew of helpful statistics:

  • rpc.MyService_handlePing.request_count
  • rpc.MyService_handlePing.response_time.min (also max, p50, p75, and p99)
  • rpc.client.MyService_handlePing.request_count
  • rpc.client.MyService_handlePing.response_time.min (also max, p50, p75, and p99)

… and many more.

Our framework provides similar functionality for REST APIs, expediting developer productivity. These integrations are handled seamlessly for developers without any intervention on their part. As a developer, I’m glad to have this framework at my disposal, as frequently I’ll encounter fencepost errors and other minor issues when instrumenting these integrations myself, which distracts from working on the core application logic.

Messaging can incur all the same setup issues around writing boilerplate, logging, and metrics, but can sometimes be even more of a hassle. Many of solutions differ on what they provide in terms of exposing retries, timeouts, and more. If your organization is not careful, integrations with certain messaging providers will become vendor-specific and hard to shift.

In our service framework, we’ve abstracted away the messaging implementation in a way that can accommodate many different messaging solutions. In a similar fashion to RPCs, I can define a message handler class like so:

public class MyMessageHandler implements MessageHandler<MyMessage> {
   public void handleMessage(@NonNull MyMessage message) {
   // …

As was the case with the RPC library, the semantics allow me to code without an understanding of the underlying messaging provider. In the framework’s codebase, there are several supported adapters to messaging providers (such as SQS), but little of this has to be exposed in the codebase of the service utilizing the framework. Should I feel the need to switch to a different messaging solution, I would only need to make a new integration with the framework that conforms to the current API, rather than n new integrations across our n services.

In addition to the capabilities discussed, our framework also facilitates common service features like leader election, circuit breaking, and more, making our services vendor-agnostic and easy to adapt. Should we desire to change our metrics provider or logging platform, it would require next to no code changes across our services, and would not require teaching all of our developers new, specialized knowledge.

Without having to worry about implementing service features, boilerplate is reduced to almost nothing, and developers are mostly just concerned with feature development. The code is simple and readable even for those who aren’t familiar with the underlying technology. This expedites each developer’s ability to jump in and contribute to a service he or she has not yet worked on. Troubleshooting bugs becomes much easier.

The consistency of the framework makes our SRE team’s lives a bit nicer. It minimizes specialized deployments, which minimizes the level of effort for the SRE team — spinning up new services becomes almost effortless.

Obviously, the main restriction of our framework is that it ties us to the Java/JVM world. Personally, I don’t necessarily consider this a downside as a good number of popular technologies have first-class support and robust client libraries for Java. But even if the JVM isn’t your style, you can still make a framework or Golang, Ruby, Python, or whatever suits your team’s preference. Whichever language you choose, writing a library that abstracts implementation details and trivia of your service infrastructure can help you get off the ground faster and on to the important stuff.

Most importantly, the results speak for themselves — the framework has facilitated the development and release of more than 100 services over the past three years. This efficiency in producing high-quality services likely would not have otherwise been possible. As engineers, we want to leverage the most high quality output for the least time spent, and our service framework allows us to achieve exactly that.