Choreography vs Orchestration in the land of serverless

You can become a serverless blackbelt. Enrol to my 4-week online workshop Production-Ready Serverless and gain hands-on experience building something from scratch using serverless technologies. At the end of the workshop, you should have a broader view of the challenges you will face as your serverless architecture matures and expands. You should also have a firm grasp on when serverless is a good fit for your system as well as common pitfalls you need to avoid. Sign up now and get 15% discount with the code yanprs15!

Choreography and Orchestration are two modes of interaction in a microservices architecture.

In orchestration, there is a controller (the ‘orchestrator’) that controls the interaction between services. It dictates the control flow of the business logic and is responsible for making sure that everything happens on cue. This follows the request-response paradigm.

In choreography, every service works independently. There are no hard dependencies between them, and they are loosely coupled only through shared events. Each service listens for events that it’s interested in and does its own thing. This follows the event-driven paradigm.

As always, neither is necessarily better than the other. Depending on the context, one might be more appropriate than the other. And since Lambda itself is inherently event-driven, the choreography approach has become very popular in the serverless community. I’m a huge fan of this approach and have built many event-driven systems using services such as EventBridge, SNS and Kinesis.

However, in this post, I want to talk about when it’s not a good idea and when you should consider the orchestration approach instead.

The TL;DR is that, when it comes to implementing workflows, you should prefer orchestration within the bounded context of a microservice, but prefer choreography between bounded contexts.

Imagine you’re building a food ordering service where customers can order takeaways from their favourite restaurants. A typical order flow might involve the following five steps.

We can model these five steps as events:

  • order_placed
  • restaurant_notified
  • order_accepted
  • user_notified
  • order_completed

With these events, we can implement the order flow using an event-driven approach.

  1. A customer places an order.
  2. place-order function publishes an order_placed event.
  3. notify-restaurant function is triggered by the order_placed event.
  4. notify-restaurant function sends a message to the restaurant via SNS.
  5. notify-restaurant function publishes a restaurant_notified event.
  6. The restaurant receives the new order notification in its mobile app.
  7. The restaurant clicks Accept Order in the app, which calls the orders API.
  8. accept-order function publishes an order_accepted event.
  9. notify-user function is triggered by the order_accepted event.
  10. notify-user function sends an order confirmation email to the customer.
  11. notify-user function publishes a user_notified event.
  12. The customer sees the order confirmation and is eagerly waiting for the food to arrive.
  13. The restaurant delivers the food to the customer.
  14. The restaurant clicks the Complete Order in the app to confirm order has been delivered. This calls the orders API.
  15. complete-order function publishes an order_completed event.

Every function acted completely independently. None of them had the notion of the overall order flow, they each only cared about:

  • What events they are interested in.
  • What they should do.
  • What events they should publish when they complete their task.
  • Each step of the flow can be changed independently.
  • Each step of the flow can be scaled independently.
  • No single point of failure.
  • Other systems can build on these events?—?e.g. a promo-code service might be interested in the order_completed event and send out discount vouchers to the customer.
  • The events are useful artefacts on their own, and can be fed into a data lake to generate business intelligence reports.
  • End-to-end monitoring and reporting are difficult.
  • Difficult to implement timeouts.
  • The order flow is not explicitly modelled and exists only as an emergent property of what system does. As such, it’s only captured in the mental model of someone who understands the system end-to-end.

From a business point-of-view, it also begs the question “are these really separate processes? Or are they different steps within one process?”.

For business-critical workflows like this, wouldn’t you want someone or some team to take ownership of and be responsible for it? When something goes wrong and you lose millions by the hour, do you want a room full of people looking at each other because no-one understands the process end-to-end?

And if there are few people in the company understands how this critical flow works, then it creates an existential risk to the business if these people ever left the company.

To implement the orchestration approach, I will probably use something like Step Functions and model the order flow as a state machine.

It’s also worth remembering that, although we no longer need to use events to trigger the next step of the order flow, those events are still useful artefacts on their own. So we should publish those same events from Task states in the state machine. For example, after the Notify User state notifies the user via SES, the Task should also publish the user_notified event.

This means we can still decouple the order flow from other business units that wish to build features on top of events related to an order. The aforementioned promo-code service can still rely on the order_completed event as before.

  • End-to-end monitoring and reporting are trivial since Step Functions gives you built-in visualization and audit histories.
  • Easy to implement timeout?—?e.g. for a restaurant to accept an order, or for the total duration of the order.
  • Business logic is in one place, and it’s easy to maintain and manage.
  • The order flow is modelled and source controlled. You can literally see it in the Step Functions console.
  • The order flow is modelled and source controlled. Yes, it’s that important that it should count as two pros!
  • Have to learn yet another AWS service.
  • At $25 per million state transitions (which counts Start and End by the way), Step Functions is a pricey service.
  • If Step Functions is down, then no orders can be processed. Although the same might be said about Lambda, EventBridge, or any services that are critical to the working of this order flow.

Within a bounded context, I have a specific set of responsibilities that are aligned with a business area. And there are hopefully a small number of components that they can all fit inside my head at the same time. Since they all work together to achieve some specific business capability such as processing payments, they form a highly cohesive unit. And since I own everything within this microservice’s bounded context, I’m free to change and reorganize things so long I don’t break my contract with external services.

I often see workflows within a bounded context being choreographed through messages in SQS/SNS/EventBridge.

Generally speaking, I think that’s a bad idea.

I love using events to integrate different services together in a loosely-coupled way. But I think it’s a bad idea when it’s done inside the same bounded context because the workflow doesn’t exist as a standalone concept that is explicitly captured and source controlled.

In these choreographed workflows, the workflow only exists as the sum of loosely connected functions. As we discussed above with the food delivery example, this makes them very difficult to reason about and debug. And there’s no easy way to implement even simple things like workflow level timeouts, or even task level tasks for that matter (e.g. timeout the order if the restaurant doesn’t accept or reject the order within 10 minutes).

If this is what you have today, you should consider moving these workflows to Step Functions instead.

But, between bounded contexts, I’ll publish and subscribe to events through SNS/EventBridge/Kinesis, etc. This is so that different parts of the larger system can stay loosely coupled and only build on each other’s events and can evolve and fail independently.

Orchestration and choreography don’t have to be mutually exclusive. Whenever I’m introducing state changes inside a state machine (such as changing the status of an order from pending to processed), I’ll publish those state changes as events. Other services can listen and react to these state changes, and bringing choreography into the picture.

Let me leave you with my rule-of-thumb when it comes to implementing business workflows: use orchestration within the bounded context of a microservice, but use choreography between bounded-contexts.

Liked this article? Support me on Patreon and get direct help from me via a private Slack channel or 1-2-1 mentoring.
Subscribe to my weekly newsletter

Hi, I’m Yan. I’m an AWS Serverless Hero and I help companies go faster for less by adopting serverless technologies successfully.

Are you struggling with serverless or need guidance on best practices? Do you want someone to review your architecture and help you avoid costly mistakes down the line? Whatever the case, I’m here to help.

Hire me.

Skill up your serverless game with this hands-on workshop.

My 4-week Production-Ready Serverless online workshop is back!

This course takes you through building a production-ready serverless web application from testing, deployment, security, all the way through to observability. The motivation for this course is to give you hands-on experience building something with serverless technologies while giving you a broader view of the challenges you will face as the architecture matures and expands.

We will start at the basics and give you a firm introduction to Lambda and all the relevant concepts and service features (including the latest announcements in 2020). And then gradually ramping up and cover a wide array of topics such as API security, testing strategies, CI/CD, secret management, and operational best practices for monitoring and troubleshooting.

If you enrol now you can also get 15% OFF with the promo code “yanprs15”.

Enrol now and SAVE 15%.

Check out my new podcast Real-World Serverless where I talk with engineers who are building amazing things with serverless technologies and discuss the real-world use cases and challenges they face. If you’re interested in what people are actually doing with serverless and what it’s really like to be working with serverless day-to-day, then this is the podcast for you.

Check out my new course, Learn you some Lambda best practice for great good! In this course, you will learn best practices for working with AWS Lambda in terms of performance, cost, security, scalability, resilience and observability. We will also cover latest features from re:Invent 2019 such as Provisioned Concurrency and Lambda Destinations. Enrol now and start learning!

Check out my video course, Complete Guide to AWS Step Functions. In this course, we’ll cover everything you need to know to use AWS Step Functions service effectively. There is something for everyone from beginners to more advanced users looking for design patterns and best practices. Enrol now and start learning!