Lida Li & Saurabh Joshi | Pinterest engineers, Cloud Management Platform
Pinterest runs its infrastructure on AWS. Every day, thousands of EC2 instances are launched, stopped and terminated because of auto scaling, as well as new service launches and cluster rotation. To serve as a single source of truth for both hardware and software configuration at the host level, we built Soundwave, a configuration management database (CMDB). Soundwave plays a critical part in supporting resource management, service automation, capacity planning, security, finance auditing, and ad hoc queries. We’ve open sourced the core components of CMDB named Soundwave to help others automatically track current and historic EC2 instances with their metadata. Here we’ll cover how a CMDB is helpful and the architecture of Soundwave.
Configuration management database
A configuration management database is typically queried to answer questions like, “How many instances of a specific type are running in our infrastructure?”, “What is the IAM role of a given instance?” and “What is the version of OpenSSH used on a given instance?”.
We built Soundwave as an internal data store to work alongside the AWS console and EC2 API, and to help with:
1. Direct querying of machine information from automation systems
2. Persisting metadata and instance information for terminated instances, to track usage history and diagnostic issues of these terminated instances.
3. Extend EC2 schema and query beyond simple filtering.
Additional benefits include:
1. A dedicated tracking of EC2 instances in our elasticsearch store helps us achieve a better query performance than EC2 API. This means getting information about all running instances takes roughly 5 seconds in Soundwave, but more than a minute from EC2 API.
2. It creates a cloud agnostic abstraction layer for configuration management that makes it easier to support hybrid cloud scenarios.
Soundwave: under the hood
There are three major components of Soundwave, and the below diagram below shows its system architecture, including:
- A Java-based worker system which synchronizes instance data with EC2 and pushes the latest data into the Elasticsearch store. The data ingestion is done in two parts:
- An AWS lambda function listens to CloudWatch events triggered by instance state changes and pushes the notifications to an Amazon SQS queue. A fleet of Java workers subscribed to this SQS queue fetch the notification data and write to the Elasticsearch store.
- Background reconciliation jobs run periodically to ensure the data in the Elasticsearch store is in sync with the EC2 data. This helps capture instance changes that don’t trigger a cloud watch event.
- A RESTful API layer to provide access to instance data via search.
- A UI dashboard for end users to perform ad hoc search queries using Lucene syntax.
Eventual consistency of EC2 API
EC2 API has an eventual consistency model. When a new instance is launched, the JAVA workers receive a notification about this instance. The worker makes an API call–DescribeInstances–to get information about the newly launched instance. It’s highly likely that many of the instance attributes aren’t populated at that moment, but will be available in a matter of minutes. Soundwave deals with these issues gracefully as described below:
- The Java workers check for the critical attributes expected to be returned from EC2. If the attributes aren’t available at the moment, it puts the instance notification back to the SQS queue with an exponential backoff time controlled by the visibility of the message. This message will be completely processed once the required attributes are available from the EC2 API.
- For other non-critical attributes, the worker runs background reconciliation jobs that fill the missing attributes when they’re available from EC2.
We open sourced Soundwave because we believe it’ll be useful to others who run their services in AWS EC2. We provided the Terraform file to configure AWS and Docker-Compose for quickly running the entire Soundwave stack in containers. To get started, check out the GitHub repository for detailed steps to set up and play with it.
If these are the kinds of projects that excite you, join us.
Acknowledgements: Ming Hao, Suman Karumuri and Jayme Cox for original design discussion and infrastructure support, CMP and SRE teams for usage feedback and bug reporting.