This October, I gave a talk at HashiConf 2018 where I shared 5 key lessons we learned at Gruntwork while creating and maintaining a library of over 300,000 lines of infrastructure code that’s used in production by hundreds of companies. In this blog post, I’ll share with you the video and slides from the talk, as well as a condensed, written version of the 5 key lessons.
Although the industry is full of cutting-edge buzzwords—Kubernetes, microservices, service meshes, immutable infrastructure, big data, data lakes, etc—the reality is that when you’re knee deep in building infrastructure, it doesn’t feel cutting edge.
To me, DevOps feels more like this:
Building production-grade infrastructure is hard. And stressful. And time consuming. Very time consuming.
Here’s roughly how long you should expect your next infrastructure project to take, based on empirical data we’ve gathered while working with hundreds of different companies:
DevOps projects always take way longer than you expect. Always. Why is that?
Well, the first reason is Yak Shaving, as perfectly illustrated in this clip from Malcolm in the Middle:
The second reason is that building production-grade infrastructure (as in, the type of infrastructure you’d bet your company on) involves a thousand little details. The vast majority of developers don’t know what those details are, so when you’re estimating a project, you usually forget about number of critical—and time consuming—details.
To avoid this issue, every time you go to work on a new piece of infrastructure, go through the following checklist:
Not every single piece of infrastructure needs every single item on the list, but you should consciously and explicitly document which items you’ve implemented, which ones you’ve decided to skip, and why.
As of 2018, here are the primary tools we use at Gruntwork to build and manage infrastructure:
- Terraform: We use Terraform to provision all the basic infrastructure, including networking, load balancers, databases, users, permissions, and all of our servers.
- Packer: We use Packer to define and build the virtual machine images we run on top of our servers.
- Docker: Some of our servers form clusters where we run applications as Docker containers. The main Docker cluster tools we use are Kubernetes, ECS, and Fargate.
Now, all of these tools are useful, but that’s not the real lesson here. The real lesson is that tools, by themselves, are not enough. You also need to change your team’s behavior.
In particular, the best tools in the world will not help your team one bit if your team isn’t bought into using those tools or if your team doesn’t have enough time to learn use those tools. Therefore, the key takeaway is that using infrastructure as code is an investment: there’s an up-front cost to get going, but if you invest wisely, you’ll earn big dividends over the long-term.
Infrastructure as code newbies often define all of their infrastructure for all of their environments (dev, stage, prod, etc) in a single file or single set of files that are deployed as a unit. This is a Bad Idea.
Here are just a few of the downsides:
- Slow: If all your infrastructure is defined in one place, running any command will take a long time. We’ve seen companies where
terraform plantakes 5–6 minutes to run!
- Insecure: If all your infrastructure is managed together, then to change anything you need permissions to access everything. That means that almost every user has to be an admin, which is another Bad Idea.
- Risky: If all your eggs are in one basket, then a mistake anywhere could break everything. You might be making a minor change to a frontend app in dev, but due to a typo or running the wrong command, you delete the production DB.
- Hard to understand: The more code you have in one place, the harder it is for any one person to understand it all. But if it’s all bundled together, the parts you don’t understand could hurt you.
- Hard to test: Testing infrastructure code is hard; testing a large amount of infrastructure code is nearly impossible. We’ll come back to this point later.
- Hard to review: The output of commands such as
terraform planbecomes useless, as no one bothers to look through thousands of lines of plan output. Moreover, code reviews become useless:
In short, you should build your code out of small, standalone, reusable, composable modules. This is not a new or controversial insight. You’ve heard it a thousand times before, albeit in slightly different domains:
“Do one thing and do it well” —Unix Philosophy
“The first rule of functions is that they should be small. The second rule of functions is that they should be smaller than that.”—Clean Code
If your infrastructure code does not have automated tests, it’s broken. You just don’t know it yet. That said, testing infrastructure code is hard. You don’t really have “localhost” (e.g., you can’t deploy an AWS VPC on your laptop) and you don’t really have “unit tests” (e.g., you can’t isolate your Terraform code from the “outside” world as all Terraform does is talk to the outside world).
Therefore, to properly test your infrastructure code, you typically have to deploy it to a real environment, run real infrastructure, validate that it does what it should, and then tear it all down (for this style of testing, see Terratest, an open source library that includes tools for testing Terraform, Packer, and Docker code, working with AWS, GCP, and Kubernetes APIs, executing shell commands locally and on remote servers over SSH, and much more). What this means is that, with infrastructure testing, you have to slightly redefine terms:
- Unit tests deploy and test one or a small number of closely related modules from one type of infrastructure (e.g., test the modules for a single database).
- Integration tests deploy and test multiple modules from different types of infrastructure to validate they work together correctly (e.g., test the modules of a database with the modules from a web service).
- End-to-end (e2e) tests deploy and test your entire architecture.
Note that the diagram is a pyramid, where we have lots of unit tests, a smaller number of integration tests, and a very small number of e2e tests. Why? Because of how long each type of test takes:
Cycle time with infrastructure tests is slow, especially as you go up the pyramid, so you’ll want to catch as many bugs as you can as low in the pyramid as you can. That means you should:
- Build small, simple, standalone modules (remember Lesson 3?) and write lots of unit tests for them to build your confidence that they work properly.
- Combine these small, simple, battle-tested building blocks to create more complicated infrastructure that you test with a smaller number of integration and e2e tests tests.
Let’s now put everything in this talk together. Here’s how you’ll be building and managing infrastructure from now on:
- Go through the Production-Grade Infrastructure Checklist to make sure you’re building the right thing.
- Define your infrastructure as code using tools such as Terraform, Packer, and Docker. Make sure your team has the time to master these tools (see DevOps Resources).
- Build your code out of small, standalone, composable modules (or use the off-the-shelf modules in the Infrastructure as Code Library).
- Write automated tests for your modules using Terratest.
- Submit a pull request to get your code reviewed.
- Release a new version of your code.
- Promote that new version of your code from environment to environment.