Chaos Engineering


Stage 0 is all about implementing good site reliability engineering practices, laying the groundwork for Chaos Engineering. The steps outlined in this post aren't necessarily prerequisites, but instead will evolve naturally alongside your Chaos Engineering practice.

  • Establish observability
  • Define the critical dependencies
  • Define the non-critical dependencies
  • Create a disaster recovery failover playbook
  • Create a critical dependency failover playbook
  • Create a non-critical dependency failover playbook
  • Publish the above and get team-wide agreement
  • Manually execute a failover exercise
  • Implementation example

Stage 1 describes the early stages of implementing Chaos Engineering, where you begin to inject failure into non-production systems and establish good practices for documenting what you learn.

  • Perform critical dependency failure tests in non-production
  • Publish test results
  • Implementation example

Stage 2 helps you take your first steps into automation and testing in production.

  • Perform frequent, semi-automated tests
  • Execute a resiliency experiment in production
  • Publish test results
  • Implementation example

Stage 3 is where you implement fully automated testing in your non-production systems and begin figuring out how to automate disaster recovery failover.

  • Automate resiliency testing in non-production
  • Semi-automate disaster recovery failover
  • Implementation example

Stage 4 is a fully mature implementation of Chaos Engineering where you begin to have ideas of your own to add to and expand your testing plan.

  • Integrate resiliency testing in CI/CD
  • Automate resiliency and disaster recovery failover testing in production
  • Implementation example
  • Chaos Engineering Article

    This article describes some of the common tools that the Chaos Engineering community considers when starting to implement the practice in an organization. The goal is to give a high level introduction to some frequently mentioned options and list some of the strengths of each using a brief table and then an annotated list.

  • Chaos Engineering Article

    Istio is a popular, open source cloud-native service mesh management application with freely available source code. This article demonstrates how to perform a few Chaos Engineering experiments using features already available in Istio.

  • With the rise of microservices and distributed cloud architectures, the web has grown increasingly complex. We all…

  • Chaos Engineering is a practice that is growing in implementation and interest. What is it and why are some of the most…