Suppose we have a web app, named
MyApp. It could be on Django, Spring, or Ruby on Rails, but it started out as a single, small application. All the code is in the same repository, everything is deployed in a single artifact, and all its tables are in the same database. As the app grows and attracts more users, it gets more data. It also gets more developers, more tables in the database, and gets hosted on more machines. The codebase starts to snowball. As we get more successful, we try to scale.
It depends on which issues start becoming the most painful. For example, suppose we have hundreds of tables in the same database, and any table can be joined against any other table. Performance problems mount, and it becomes confusing to troubleshoot or do any kind of query optimization. A natural solution would involve namespacing and isolation. We can separate tables into different databases, and slap a programming interface around them. Let’s call this group
MyAggregate. We might host the database for
MyAggregate on a separate machine to further isolate resources. This allows us to provision dedicated CPU, memory, and storage for
Suppose we try to stretch our dollar further and notice that, while the work done around
MyAggregate is slow and memory-intensive, it can be done asynchronously. An ingenious engineer suggests that we can have the main application publish tasks to a queue to offload the work. We would then spawn a process from the same artifact to consume these tasks. We might also move the consumer processes to separate specialized hosts.
A few days later, the same engineer realizes that, if they are only modifying code that affects
MyAggregate but not the rest of the app, they could deploy just the consumer. Shortly, the engineers working on
MyAggregate create their own CI/CD process. They promise to maintain a stable programming interface in exchange. Tests are written against this interface as a guarantee.
At this point, we have a single artifact, multiple databases, multiple disparate hosts, multiple deployment processes, and an interface that walls off
MyAggregate from the rest of
MyApp. We could even consider creating separate artifacts. Perhaps, someone on the team really cares about the size of the artifact. Sometimes, they really just need to start the consumer process for a quick test, and downloading the entire artifact can take too long.
Is this still a monolith? Is it quite a microservice? Is this a modular monolith?
There are occasional claims that microservices help to improve scalability and performance. While these claims and benefits are important to evaluate and assess in a retrospective, they are usually not the main impetus for breaking up a monolith. It is a nice bonus: during the refactoring and breaking up of a monolith, we can often unlock performance gains by deleting technical debt, rewriting old software, rewiring network layouts and queues, splitting up a monolithic data store, or reallocating resources for improved isolation.
In most cases, breaking up a monolith using plain RPC worsens performance. It adds a node in the critical path that can flake or block on I/O. Even with HTTP/2 and gRPC, serialization and validation can add enough overhead to warrant custom APIs that handle batched requests.
The real scalability is w.r.t. the number of employees, so that specialized teams in the engineering and product organizations can work on their area of expertise. This reduces cognitive load and communication overhead because each specialized team can depend on stable upstream APIs, and focus on providing a stable API for downstream consumers. In other words, when we think about the kind of scalability that motivates a monolith decomposition, we are thinking about the kind that can accelerate the pace of change while rapidly hiring and growing an organization.
Will this scale?
—10 Tricks to Appear Smart During Meetings, The Cooper Review
So, if you have 10 engineers and 100 microservices, you are doing it wrong.
The real test of “how micro” and “how many” is this: can specialized teams independently build and deploy software that is within their scope of responsibility?
Until this condition is reached, one should continue decomposing the monolith, and stop when it is just right. Then, as the organization grows and continues specializing even further, the process starts over.
An important observation is that, if an organization does not have specialized teams, it is a good idea to use a monolith. If most contributors have workflows that cut across multiple functions of a single product, forcing microservices too early will slow down the development process. Migrating from one to the other involves an analysis of the amount of context that one would need to work on a component — the required context needs to fit inside one’s head.
Using this test, an organization can avoid ending up in a situation where a developer has to work across multiple microservices, concerning themselves with differences in versions of third-party libraries, and dealing with the separate deployment of each microservice for every change. Likewise, it also avoids the common anti-pattern where one goes from a single monolith to a large number of anemic CRUD services overnight.
For example, consider an organization that has about 60 engineers and no middle managers. It goes through a significant reorganization and creates a specialized team of 5 engineers that is only responsible for maintaining a shared messaging component that can be used across all products. This team probably gets a dedicated manager, and while there are company-wide KPIs, this team is primarily evaluated (and incentivized) based on the reliability and performance of the messaging component.
These metrics can include:
- the latency of each send request,
- loss rate or delivery rate, and
- whether it can send millions of emails, notifications, and text messages for a marketing campaign in under 2 hours.
It is now ripe for this team to design a messaging API, and break this component out of the monolith so that they can develop, deploy, monitor, and support it independently. This freshly minted team could even negotiate for different SLAs for their online APIs and their offline batch APIs. It is important to note that, as part of their daily workflows, they do not have to figure out how to run other services (upstream or downstream) on their laptops. The CI/CD process might not even start other services, though a careful product owner would test significant changes on stage before deployment to production.
If this team is also responsible for maintaining the communication preferences of their users, they can also choose to combine that function into the same messaging (micro)service and extend its API, instead of creating a new one.
Is it micro? I don’t know, but they sure are more productive than before, and are having drinks at Happy Hour on Fridays instead of begging the unfortunate on-call for permission to deploy the monolith after normal hours.
It turns out that the most important aspects of a successful transition to microservices are the delivery infrastructure and DevX. Since our goal is to accelerate the pace of change, these are much more important than choosing between aRPC/gRPC/mRPC/0rpc, or the microservice framework to use. Moreover, each organization has different habits and expectations, and for this reason, you cannot go from a monolith to hundreds of microservices overnight. It takes dedicated effort to iron out the kinks for the first few decomposed services and figure out an acceptable testing strategy. This involves the development of tooling to
- validate stability and compatibility of APIs,
- help developers start and stop services locally,
- help developers find metrics and logs in all environments,
- send notifications and alerts to the right people with context when the health of a service is threatened, and
- enable operators and developers to troubleshoot and fix problems with each service.
This is the real work that improves scalability.
One of the most important decisions will be around the management of 3rd-party dependencies and internal libraries. There are many difficult questions to answer, including:
- Should all artifacts share the same versions of all 3rd-party libraries? Would internal libraries have to work with a range of 3rd-party libraries? How will this be tested?
- What is the process for upgrading the minor version of a 3rd-party library? major? Who is responsible for it?
- Should all artifacts use the HEAD of all internal libraries? Should internal libraries use semantic versioning instead?
- Who shall be responsible for maintaining internal libraries? Who shall upgrade the version used in different artifacts? How many major versions would have to be supported?
- How will version conflicts of internal libraries or 3rd-party libraries be resolved?
- What is the process for ensuring that we judiciously upgrade to the latest version of internal libraries or 3rd-party libraries?
- How will one specify the bill of materials for building each artifact?
- How can we control cyclic dependencies in internal libraries?
There are many differing opinions, and there are many ways to set up an organization to make this work. However, it is important to recognize that this will be a problem if managed poorly, and will become a source of friction and contention. Allowing versions of 3rd-party libraries to differ between artifacts can make it easier for different teams to upgrade an important library on separate schedules, but this requires any affected internal library to support a range of major versions.
It could be an enticing option to “pin everything”, but without clear responsibilities and an upgrade process, projects can languish in a pool of outdated and potentially vulnerable dependencies for a long time.