Written By: James Novino
The Order Management System (OMS) at Jet is responsible for a number of business functions:
- Order Initialization and Verifications
- Charging / Credits / Money Management
- Order Fulfillment Integrations
- Order History
- Concessions (Refunds, Returns, etc. )
These flows are asynchronous processes that are currently executed in a Service Oriented Architecture (SOA) style architecture. This general architecture consists of a collection of microservices orchestrating tasks. I have written some posts about some of the microservice patterns that we use:
- F# Microservice Patterns @ Jet.com
- Scaling Microservices @ Jet.com
- Abstracting IO using F#
- Observability Using Abstracted IO
These posts give some insight into how OMS operates as well as on the other Jet teams when it comes to building/deploying and managing our systems.
Traditionally, the features that OMS are responsible for have been developed and run on microservices using a combination of pub/sub, event sourcing, HTTP calls, and a few other technologies. However, as Jet has grown and the requirements of the business expanded, the complexity of the system and architecture also increased until we were running dozens of services. As the number of services increased it became harder to maintain and improve the system. This lead to longer feature development cycles because the logic of nearly all features was distributed across multiple services, which made it hard to maintain and debug to service.
This is the case for most microservice based architectures as they are often tightly coupled to things like input/output, SLAs and individual service responsibilities, all of which make it harder to adapt to changing needs. For our system at Jet, each “business flow” was orchestrated across a set of microservices that handled specific actions for that flow i.e. (Creating the Order, Updating Order History, Sending Transactional Emails, etc.)
In our microservice based architecture, each service is implemented using the same boilerplate:
|> decode (DomainEvent -> Input option)
|> handle (Input -> Async<Output>)
|> interpret (Output -> Async<unit>)
The decode function would manipulate the consumed event from the incoming-streams and turn it into the Input type which would then be feed to the handle function; which would perform any checks or gather any additional data required and then pass it to the interpret function which would perform the side effects.
This process is inherently complex and requires a lot of boilerplate to effectively model and build. This boilerplate provided the template for building services using our
decode -> handle -> interpret pipeline but still required that each microservice incorporate scalability, performance tuning, idempotency, error handling & retries, logs, metrics, dashboards, etc. into the design. The complexity of building/maintaining a microservice based architecture ends up having negative effects on the overall system and team as the system grows.
This is why, in January of 2017, we set out to model and create a new system/platform capable of representing all the complex business flows we currently have in our microservices. The workflow system that was designed/built is now known as OMS 2.0 The OMS 2.0 system is a workflow system that was designed and built to help facilitate the improved development, scalability, and extensibility of the Jet Order Processing system.
The core design of the workflow system is based on a set of guarantees that we attempted to provide in our old microservice based architecture:
- Idempotence — The ability to de-duplicate events (triggers) based on a unique identifier.
- Consistency — We provide support for multiple different backing stores as our state management layer. This means that we also provide configurable consistency models based on our backing store. The implementation of our state management layer will always mimic a strong consistency model since we need to be able to read our own writes.
These guarantees ultimately led to the development of a system the provides a number of additional capabilities when compared with our prior microservice based architecture:
These capabilities are a few of the key differences between our microservice based architecture and our new workflow system. The workflow system key capabilities are:
- Deep Workflows — Workflows are represented as DAGs (Directed Acyclic Graphs) but can be as nested as you want them to be.
- EventSourcing — State changes are all fully event sourced into the journal without having the need to understand the event sourcing semantics. The system is designed in a way to encourage thinking in terms of “events”. Note: The OMS system has always had event sourcing as a key component but with this new system, it’s baked directly into the system.
- Simple Implementation — Business functionality is implemented as a workflow definition and the corresponding step implementations. This forces modularization of the system and forces developers to think upfront and think through the business flow, before jumping into the implementation.
- Reusability — Designed with reusability in mind but leaves enough flexibility to the developers so they can design flows as they see fit. When new workflows need to be introduced into existing flows, developers can either create new flows from existing steps or reuse at the workflow level. For example, in order to handle digital SKUs (Apple Care) within Jet, we added new workflows to handle warranty, while allowing the rest of the order workflow to remain unchanged. This allows us to iterate rapidly and deliver new functionality into the system at great pace.
- Verification/Regression — The verification of step behavior via tooling allows for quickly run regression tests. We take full advantage of this capability at Jet and our workflow implementations have over 80% of code-coverage without having to write dedicated tests. New tests can be easily generated with the click of a button.
- Idempotency — Provides idempotency guarantees for a workflow. Idempotency is achieved through a combination of unique identifiers configured from the workflow DSL.
- System Scalability — The system is elastic in nature. The system can be easily scaled to improve the throughput of business flows. In our current usage, the scale of the system is limited by the number of partitioned consumer channels used to communicate between the core services.
- Workflow Versioning — The system maintains the workflow version throughout the execution of the said workflow. That is, Workflows don’t change for an instance once it gets started. This allows us to deploy changes to workflows without having to be concerned about those executions in flight. Since workflows are a representation of business flows, this allows us to iterate on business flows independently.
- Low-Level Concerns are Handled Once — Scalability, Performance, Idempotency, Retries, Error-Handling, etc. are all handled at the platform level rather than in each microservice.
- Metrics /Monitoring— System level metrics are easily controlled and business flow execution is easily tracked/traced and monitored.
- State Management — A single source-of-truth for all state changes known as the Journal. The Journal acts as a standard log that powers the system. It also acts as a debugging tool to fully understand the history of a given workflow. Having traceability over state changes is a big difference compared to the event sourced microservices we have historically used.
- Support for Deferred Workflows — The ability to configure a workflow definition to allow only one workflow to execute at-a-time. This ability is very challenging to achieve in Microservice based architecture.
- Manual Review — Workflows that are either blacklisted or have some unwarranted failure are written for manual review and can be reviewed and resubmitted through the Visualizer (Front-End for the Workflow System, more on this in subsequent posts)
The new workflow system was largely inspired by Life Beyond Distributed Transactions by Pat Helland and was designed as a two-layer architecture :
- Infrastructure Layer — handles concerns such as scalability, idempotency & correctness, error-handling/retries, logging, metrics, etc. The goal was to solve these concerns once and not for every service or use-case.
- Workflow Layer — Scale-agnostic and deals with actual business implementation, this is the workflow DSL and the corresponding step implementations.
Note: More details on these layers and how they map to the concepts discussed in the paper will be expanded in future posts
The system architecture is essentially a highly abstracted version of the
decode -> handle -> interpret
pipeline we use in our microservices. The workflow system takes this pipeline and separates concerns even further by drawing service boundaries between each of these operations
- Workflow Triggers (decode)
- Workflow Executor (handle)
- Side Effect Executor (interpret)
Since we have very clearly defined service boundaries with the above architecture this allows the execution of workflows to be highly parallelized with multiple instances of each service which can all be scaled independently. Note: Not everything is described in fine detail here but the above gives a fair template of the core architecture
I’ll discuss the details of how the DSL works in the second post of this series, for now, the important parts to note are that a workflow must contain the three things: :
- Trigger: How should we trigger this workflow, i.e. Kafka message, service bus message, EventStore steam, REST, etc.
- Metadata: This is what controls meta-parameters of the workflow like retries, concurrency locking, aggregates, etc.
- Steps: What are the critical behaviors of the workflow or what should the steps be? In what order should we execute the steps, should there be conditional steps, can we execute more than one step in parallel, etc.
How are Side-Effects represented in this DSL? The Answer is that they are implemented in the step definitions:
Each step has to return the
StepEvaluation.Result, type that contains three parameters:
- State: The current state that is passed between steps
- Input/Output: Steps have the ability to pass an output which is supplied to the preceding step as it’s input.
- Side-Effects: A list of side-effects that need to be executed by the step.
These parameters are used to orchestrate the step within the workflow engine. The details how what and how these process works are the topic of a later post.
The workflow system also included a tool called the visualizer, that lets users get insights about workflows. The Visualizer can show workflow details both for in-flight workflows as well as historical workflows, by showing the details in the workflow journal. Shown below is an example of how the Visualizer supports inspecting any single execution of the workflow.
The full capabilities of the visualizer are the topic of another post however some key highlights are described below :
- Verification & Regression Tests: Ability to verify one or more steps from a previous execution of a workflow.
- Manual Review: Ability to inspect and resubmit failed workflows or side-effects
- Self-Documenting: Ability to review state, input, and output for every workflow and every step at any point in the execution of the workflow.
Below are some of the stats from the production instance we have been running for a little over a year now. Most of these workflows are used in the standard order processing flows:
The following features are things that have been discussed as logical extensions of the workflow engine.
- Support for Lambda’s (or similar) server-less functions. These functions would act as step implementations for the workflow which would be leveraged by the Orchestrator.
- Support for .Net Core and Linux containers
It’s not unknown to us that there are alternatives for workflow orchestration and design like:
We choose to build the OMS 2.0 platform for a few reasons rather than trying to adapt something more out-of-the-box:
- Ability to maintain a separate data store to hold workflow events to recover from failures
- State tracking and management, ability to replay or visualizer the state at any point in the execution
- UI for visualization of flows and in process workflows
- Integration with our existing infrastructure, and integration with our existing technology stack (Microsoft Azure + F#)
- Extensibility in adding new features or functionality depending on business needs
- Scalability, the ability to scale workflow execution across multiple VMs, in several regions.
Some of the alternatives had some of this functionality but lacked support for things like extensibility, state tracking/management, etc., which ultimately lead us to decide that we needed to build our own system.
The migration from a distributed microservice based architecture to a workflow based one has had dramatic effects on our development, support, and design overhead. The ability to design and rationalize complex business flows as a DSL and then implement single-responsibility steps has had profound effects on our ability to innovate and build complex systems. The other benefits that the workflow engine provides like tooling, troubleshooting, scalability are topics of future posts.
If you like the challenges of building distributed systems and are interested in solving complex problems, check out our job openings.