By Rob Zuber, CTO at CircleCI.
Editors note: This is a follow-up post from How CircleCI Processes 4.5 Million Builds Per Month.
CircleCI is a continuous integration and delivery platform that enables you to automate your development process quickly, safely, and at scale. Engineers around the world at companies of all sizes trust us to run their tests and deploy their software. We’ve earned that trust by using a solid stack of software allowing our users and their teams to continuously deliver value to their users.
As CTO at CircleCI, I help make the big technical decisions and keep our teams happy and out of trouble. Before this, I was CTO of Copious, where I learned a lot of important lessons about tech in service of building a consumer marketplace. I like snowboarding, Funkadelic, and cappuccino.
In the last year, we have seen tremendous growth on our engineering teams. This growth forced us to rethink our engineering growth paths as well as what we needed in engineering managers. One key result of this was an update to our engineering competency matrix that we published. This has helped us in hiring for our open positions and creating better career paths for our current engineers. It also increases transparency in expectation. Reasonable and clear expectations, team alignment, and transparency are key values of our engineering teams as they consist of engineers working remotely at locations distributed around the world.
We use Pingboard to find out who is “in” or “out” of the office, Zoom for video conferencing and screen sharing with those who are “in”, Slack for synchronous and asynchronous communication, and to organize all of the efforts across the teams, we use JIRA. Additionally, Slack-based integration tools such as Hubot, PagerDuty, Looker, Amplitude, Datadog, and Rollbar are used to bring information and data into our primary communication tool. Giphy’s Slack-based integration has also proven itself to be invaluable.
Most of CircleCI is written in Clojure and it has been this way since almost the beginning. Early development included Rails, but by the time that CircleCI was released to the public, it was written entirely in Clojure. Clojure is still at our platform’s core. It helps having a common language across much of our stack to allow for our engineers to move between layers of the stack without much overhead.
Being fans of Clojure is not reason enough to build out the entire stack in that language. When we launched our 2.0 platform, the build agent was written in Go because it allows us to inject a multi-platform static binary into an environment where we can’t rely on lists of dependencies. Go is also used for CLI tools. Here, fast start-up and static dependency compilation outweigh our affinity for Clojure. Clojure remains our language of choice, but as we continue to pull microservices from our monolith (over a dozen at this point), we are committed to using the right set of tools for the job and we evaluate that decision for each new service.
We are in the process of adopting Next.js as our React framework and using Storybook to help build our React components in isolation. This new part of our frontend is written in Typescript, and we use Emotion for CSS/styling. For delivering data, we use GraphQL and Apollo. Jest, Percy, and Cypress are used for testing.
Two Pools of Machines
Our backend consists of two major pools of machines. One pool hosts the systems that run our site, manage jobs, and send notifications. These services are deployed within Docker containers orchestrated in Kubernetes. Due to Kubernetes’ ecosystem and toolchain, it was an obvious choice for our fairly statically-defined processes: the rate of change of job types or how many we may need in our internal stack is relatively low.
The other pool of machines is for running our users’ jobs. Because we cannot dynamically predict demand, what types of jobs our users need to have run, nor the resources required for each of those jobs, we found that Nomad excelled over Kubernetes in this area. Our users’ jobs are changing constantly. The fast, flexible, built-in scheduler that comes with Nomad distributes our users’ jobs across our second pool of machines, reserved specifically for scheduling purposes.
We did evaluate both Kubernetes and Nomad to do All These Things, but neither tool was optimized for such an all-inclusive job. Additionally, we treat Nomad’s scheduling role as more a piece of our software stack than as a part of the management or ops layer. So we use Kubernetes to manage the Nomad servers.
We’re also using Helm to make it easier to deploy new services into Kubernetes. We create a chart (i.e. package) for each service. This lets us easily roll back new software and gives us an audit trail of what was installed or upgraded.
Previously, we had run all of our infrastructure on AWS. At first, it was simple because our architecture was simple. As our architecture grew in complexity, our AWS infrastructure grew to include a complex stack of VPCs, Security Groups, and everything else AWS offers to help partition and restrict resources. We are also running across multiple regions. We have adopted Terraform to help us manage this complexity in a scaling team.
Once we launched CircleCI Enterprise (our on-prem offering), we began to support different deployment models. During this time, we also began to package our code in Docker containers. This allowed us to start using cloud-agnostic Kubernetes to manage resources and distributions and it reduced our cloud vendor lock-in.
We now push a part of our workload to GCP. If you use our machine executor to run a job, it will run in GCP. This executor type allocates a full VM for tasks that need it. GCP is well-suited for running small, short-lived VMs. We’ve also wrapped GCP in a VM service that preallocates machines, then tears everything down once you’re finished. Using an entire VM means you have full control over a much faster machine. We’re pretty happy with this architecture since it smooths out future forays into other platforms: we can just drop in the Go build agent and be on our merry way.
Communication with the Frontend
To get the frontend to communicate with the backend, we use a dedicated tier of API hosts. We manage these API hosts with Kubernetes as well, but in a separate cluster to increase isolation. A number of our APIs are public, meaning that we use the same interfaces that are available to our users. By dogfooding our APIs, we’ve been able to keep them clean and spot and fix errors before our users discover them.
When you interact with our web application, all of your requests are hitting the API hosts. We handle the majority of our authentication via OAuth from GitHub or Bitbucket. We provide programmatic access to everything exposed in the UI through an API token that you can generate once you have authenticated.
Data! Data! Data!
We use MongoDB as our primary datastore. Mongo's approach to replica sets enables some fantastic patterns for operations like maintenance, backups, and ETL. We’re happy to see progress being made in WiredTiger, and our operations have greatly improved, but we’re still suffering from a legacy of our early mistakes in schema enforcement on a dataset that is too large to clean efficiently.
As we pull microservices from our monolith, we are taking the opportunity to build them with their own datastores using PostgreSQL. We also use Redis to cache data we’d never store permanently, and to rate-limit our requests to partners’ APIs (like GitHub).
When we’re dealing with large blobs of immutable data (logs, artifacts, and test results), we store them in Amazon S3. We handle any side-effects of S3’s eventual consistency model within our own code. This ensures that we deal with user requests correctly while writes are in process.
A Build is Born
When we process a webhook from GitHub/Bitbucket telling us that a user pushed some new code, we use the information to create a new pipeline representation with associated workflows and jobs in our datastores. We then pass the definition of the work to be performed to Nomad, which is responsible for allocating hardware to run the jobs.
Running the Build
The gritty details of processing a build are executed by the creatively named build agent. It parses configuration, executes commands, and synthesizes actions that create artifacts and test results. Most builds run in a Docker container, or set of containers, which is defined by the user for a completely tailored build environment.
The build agent streams the results of its work over gRPC to the output processor, a secure façade that understands how to write to all our internal systems. In order to get this live streaming data to your browser, we use WebSockets managed by Pusher. We also use this channel to deliver state change notifications to the browser, e.g. when a build completes. We also make use of Redis's amazing performance to stash bits of output as we collate it for permanent S3 storage.
A Hubot Postscript
We have added very little to the CoffeeScript Hubot application – just enough to allow it to talk to our Hubot workers. The hubot workers implement our operational management functionality and expose it to Hubot so we can get chat integration for free. We’ve also tailored the authentication and authorization code of Hubot to meet the needs of roles within our team.
For larger tasks, we’ve got an internal CLI written in Go that talks to the same API as Hubot, giving access to the same functionality we have in Slack, with the addition of scripting, piping, and all of our favorite Unix tools. When the Hubot worker recognizes the CLI is in use, it logs the commands to Slack to maintain visibility of operational changes.
Analytics & Monitoring
Our primary source of monitoring and alerting is Datadog. We’ve got prebuilt dashboards for every scenario and integration with PagerDuty to manage routing any alerts. We’ve definitely scaled past the point where managing dashboards is easy, but we haven’t had time to invest in using features like Anomaly Detection. We’ve started using Honeycomb for some targeted debugging of complex production issues and we are liking what we’ve seen. We capture any unhandled exceptions with Rollbar and, if we realize one will keep happening, we quickly convert the metrics to point back to Datadog, to keep Rollbar as clean as possible.
We use Segment to consolidate all of our trackers, the most important of which goes to Amplitude to analyze user patterns. However, if we need a more consolidated view, we push all of our data to our own data warehouse running PostgreSQL; this is available for analytics and dashboard creation through Looker.
At CircleCI, we get to practice what we preach. Instead of long dry spells between releases, we push several changes per day to keep our feedback loops short and our codebase clean. We’re small enough that we can move quickly but large enough that our teams have the resources that they need.
This is our stack today. As our users demand solutions for more complex problems, we’ll adopt new tools and languages to deal with emerging tech. We are excited about the future, but while we wait for that future to unfold, there is no reason you should be waiting for good code. Start building on CircleCI today and ship your code faster. We’re also looking for people who are interested in collaboration and learning and who want to join us in shaping the future of software engineering supporting our internal teams, as well as the thousands of organizations using our product. Come work with us and help us ship our own code faster!