Kevin Chen | Software Engineer Intern, Visibility
Brian Overstreet | Software Engineer, Visibility
In this post, we’ll share the algorithms and infrastructure that we developed to build a real-time, scalable anomaly detection system for Pinterest’s key operational timeseries metrics. Read on to hear about our learnings, lessons, and plans for the future.
Pinterest uses an in-house metrics and dashboard system called Statsboard that allows teams to gain a real-time understanding of the state of their services. Using Statsboard, engineers can create alerts using a language that wraps common time series operations into a simple interface. These alerts include:
- Static thresholds: Thresholds that alert a user when a system or service reaches a level (ex: 95% CPU usage) indicating some sort of failure.
- Week over week differences: Seasonal alerts that signify when data points at the current time are some proportion lower (ex: 10% less) than the week before.
However, many of Pinterest’s top line growth and traffic metrics (ex: site requests, user logins) exhibit dynamic patterns that make it difficult to set rule-based alerts. Without an anomaly detection system capable of building a model to handle these dynamic metrics, users are faced with three undesirable alert situations:
- Temporal invariance: Static threshold alerts that account for drops in a seasonal metric (e.g web traffic) during the day, will result in false positives during the night, when those metrics naturally taper off. To capture the behavior of metric generating processes that vary with time, our alert has to be parametrized by time.
- Rule edge cases: Suppose we create a static threshold that only pages from 8AM to 8PM. If an incident occurs at 8:01PM, no page would occur. Edge cases in rule-based systems can lead to blind spots and additional layers of complexity.
- Interpreting anomalies as normal: If we have an incident that leads to a site-wide metrics decrease on Wednesday, we can get a storm of week-over-week increase alerts during the following Wednesday. We have to take care to not treat anomalous behavior as normal behavior, or we’ll receive false alarms when we return to normal.
Landscape of anomaly detection
There is a great deal of literature on anomaly detection out in the world — from open-source packages like Twitter’s AnomalyDetection or Linkedin’s Luminol, to academic works like Rob Hyndman’s papers on feature-based anomaly detection. We can classify the most popular of these time series anomaly detection techniques into four broad categories (which are neither mutually exclusive nor all encompassing):
- State space models: exponential smoothing, Holt-Winters, ARIMA
- Decomposition: classical decomposition, STL
- Deep learning: recurrent neural networks
- Dimensionality reduction: RPCA, SOM, discords, piecewise linear
In practice, we observed the best performance with decomposition models. Many machine learning models perform poorly for anomaly detection because of their tendency to overfit on the training data. For anomaly detection, we need to be able to distinguish and extract even the anomalies that we haven’t seen before. Decomposition models perform well at this task by explicitly removing correlated components (seasonality, trend) from the timeseries — giving us the opportunity to apply statistical tests with gaussian assumptions on the residuals whilst preserving the anomalies’ original attributes.
Requirements of anomaly detection for observability
To build an anomaly detection system for observability, we have to keep a number of requirements in mind:
- Minimize false positives: False positives lead to sleepy and angry engineers, which desensitize them to alerts, leading to decreased performance and potentially missed incidents in the future. To avoid this, we only alert on the most severe anomalies, and allow users to craft custom alert rules if they need additional specificity.
- Alert within minutes: When an incident occurs, on-call engineers should be notified as soon as possible to minimize impact. We process every incoming data point in real time and determine whether or not it is an anomaly within a minute of arrival.
- Scale to millions of time series: Our observability systems deal with hundreds of millions of time series streaming data points every second. Decomposition algorithms are already efficient in comparison to most machine learning methods, but we make some alterations to eke out additional performance. We also autoscale our forecasting workers using Teletraan to respond to increased demand.
- Be robust to anomalies: When an anomaly occurs, our system should not incorporate these data points into our estimate of normal behavior. We use robust statistics and long historical data windows (up to three weeks) to avoid this.
- Be robust to missing data: At times, metrics may have missing data points. The most common implementation of decomposition (R’s STL) doesn’t support missing values, so we have to take care to allow these kinds of gaps in our decomposition method.
- Take actionability into account: Some anomalies are far more important and actionable than others. Our user-set rules can help to filter the actionable anomalies from the noise.
Next, we’ll dive into the challenges we faced deploying our models in real-time.
How do we update our model in real-time?
Much of the literature on anomaly detection deals with finding retrospective anomalies in a static dataset i.e fitting a model with a nonlinear optimization procedure, such as LBFGS or Nelder-Mead, and then identifying anomalies within the training set.
However, if we want to find anomalies in real-time, training once is not enough — we have to continuously keep our model up to date to adapt to the latest behavior of our metric. Thus, we have to take one of four approaches to update our model parameters over time.
- Brute Force Updates: The simplest solution is to simply recompute our parameters on the most recent data window every time a new data point arrives. However, this can be infeasible if fitting the model to the window is too computationally complex.
- Scheduled Updates: We can cache our model parameters for a given period of time, say 24 hours, and retrain on the new data points at the end of each period. However, excessive false positive alerts can occur if the behavior of our metric changes before our scheduled update.
- Event Driven Updates: If a high prediction error for the recent set of data points has been detected, we can use this as an opportunity to recompute our model parameters. Event-driven updates are unpredictable, which can lead to operational challenges in the future.
- Online Updates: For some algorithms, it is also possible to reformulate them to work in the online setting: continuously reading in new data points and efficiently updating the parameters with each data point.
In our case, we found that the operational overhead of debugging scheduled or event driven systems outweighed their benefits. Instead, we use online updates whenever available, and brute force updates otherwise. To recast our algorithms into the online setting, we use Online Gradient Descent as an optimizer.
Seasonality and trend estimation
The most commonly used form of decomposition today, called STL, makes use of a regression procedure called Loess, which repeatedly fits low-degree polynomials to subsets of the data. Unfortunately, this procedure is not scalable for real-time analytics.
In lieu of estimating seasonality with iterated Loess, we instead use a simple ensemble of efficient seasonality estimation techniques (e.g fourier, historical sampling). We also replace the trend component of STL with robust statistics viz. median, inspired by Twitter’s S-H-ESD anomaly detection algorithm.
Once we have our predictions, we can then use prediction intervals to determine whether a data point is an anomaly, based on its distance from our forecast. To do this, we can take advantage of the three sigma rule, which states that 99.73% of values in a normal distribution lie within three standard deviations of the mean. If an incoming data point is more than three standard deviations of the residual from our forecast, we can reasonably classify the point as an anomaly.
If the residual data is non-normal (e.g when the time series contains anomalies), we can use power transformations to attempt to recover a normal dataset. But if these transformations do not work, we can still resort to other measures — like sampling from the previous errors — to get an estimate of the uncertainty.
Once we’ve generated these intervals, we can translate them into visual bands on the UI side for ease of interpretation and tweaking by our end users.
Now, let’s take a look at the infrastructure that supports these algorithms.
Anomaly detection architecture
We have a forecasting server that is responsible for constructing one-step-ahead forecasts for Statsboard metrics in real-time and persisting them to our time series database (TSDB).
The forecasting server consists of a set of autoscaling workers and a server with a job queue and a scheduler that submits one-ahead forecast jobs every minute. For batch jobs, we pre-merge overlapping data window requests in order to minimize network and I/O. We integrate with our existing Statsboard UI and alerting infrastructure to provide anomaly detection builders, alerts, and dashboards.
Anomaly detection plays an important role in obtaining visibility for metrics that exhibit complex patterns that can’t be modeled by traditional alerts. Uncovered anomalies show up in real-time dashboards and alert the relevant users when something goes wrong.
Through the use of robust, scalable, and interpretable algorithms, our anomaly detection system helps engineers recognize and react to incidents as they happen, minimizing impact to the Pinterest business and our Pinners.
Dai Nguyen built the UI for this project. Special thanks to Naoman Abbas, Humsheen Geo, Colin Probasco, and Wei Zhu of the Visibility team as well as Yegor Gorshkov of the Traffic team for design recommendations and review.