Canary analysis: Lessons learned and best practices from Google and Waze | Google Cloud Blog


In the report, you can click on the “metrics” link to examine exactly what queries were made and what result they got.

Handcraft the queries to your monitoring system

To get to a working canary configuration, it helps to work backward: start by designing the queries Spinnaker will run against your monitoring system. You need to be able to run those queries manually. In the case of Stackdriver, you can use the APIs Explorer. For Prometheus, use the Expression browser.

Once you’re satisfied with your queries and their results, you can do the “translation” work to replicate them with a canary configuration in Spinnaker. This is where the canary reports are most useful.

Use retrospective mode

A real canary analysis can last several hours. It’s unreasonable to wait that long for each iteration when you’re developing the canary configuration. To avoid this, use retrospective mode, available in the canary stage configuration: it runs the analysis against past, existing metric data rather than waiting for new data to be gathered.

Monitor your new pipeline before you trust it

Finally, even when you’re satisfied with the first iteration of your canary configuration and you’re ready to test it in production, don’t fully rely on it right away. Implement one of the following pipeline schemas to increase your confidence in the canary configuration:

  • Put the canary analysis stage in a non-blocking branch of the pipeline. That way, it doesn’t actually influence the result of your deployments, but you can check if the result of the canary analysis on production workloads is what you expect.
  • Add a manual judgment stage after the canary analysis where a person checks that the results are what you expect. This is what Waze is doing in the pipeline described above.

After a period of time, you can remove those stages and let the canary analysis fully automate your deployment pipeline.

Canary best practices

As you develop your canary configuration, follow these practices to make your canary analyses reliable and relevant. You can see the full version of these best practices on the Spinnaker website.

Compare the canary against a baseline, not against production

Don’t compare the canary to production instances. Many differences can skew the results of the analysis: cache warmup time, heap size, load-balancing algorithms, etc.

Instead, compare the canary deployment against a baseline deployment that uses the same version and configuration as the production deployment, but that’s otherwise identical to the canary, in terms of:

  • time of deployment
  • number of instances
  • type and amount of traffic

Comparing a canary to a baseline isolates application version and configuration as the only factors differentiating the two deployments.

Run the canary for enough time

You need at least 50 pieces of time-series data per metric for the statistical analysis to be relevant. In Spinnaker, a canary analysis stage can include several canary runs that each need those 50 data points. Depending on the granularity of your monitoring data, this means that you might have to plan for canary analyses that are several hours long.

Carefully choose which metrics to analyze

While you can get started with a single metric, we advise that you use several metrics that represent different aspects of your application’s health.

Three aspects are defined in the SRE book as being particularly important:

  • Latency: how long does your application take to respond to a request?
  • Errors: how many errors does your application encounter?
  • Saturation: how many additional requests can an instance of your application handle in parallel?

You can group metrics together and put different weights on those groups. This lets you increase the importance of a particular group of metrics in the statistical analysis.

Create a standard set of reusable canary configs

Developing a canary configuration is difficult, and you probably can't expect everyone in your organization to do so. To help everyone, and to keep the canary configurations maintainable, create a set of standard configurations that all the teams can use as a starting point. This is much easier to do if all the applications expose the same set of monitoring metrics.

Conclusion

Coming up with a good canary config is a long, iterative process. Expect to spend time fine-tuning parameters, such as:

  • metrics to analyze
  • weights on metric groups
  • thresholds
  • length and the number of canary runs

Though Spinnaker makes implementing canary deployments much easier than building a system yourself, you won’t get it exactly right the first time. But investing in canary deployment will greatly increase your confidence in your deployment processes, lower the number of problems that impact your users, increase your velocity, and hopefully lower your stress level!

Next steps

Here are some things to help you learn more about this automated canary analysis in Spinnaker:

Acknowledgements

We would like to thank our collaborators for making this blog post and the work associated with it possible:

  • Andrew Phillips, Product Manager for Releases & Rollouts
  • Matt Duftler, Software Engineer for Spinnaker
  • David Dorbin, Technical Writer for Spinnaker