Run Kubernetes Production Environment on Spot Instances With Zero Downtime: A Complete Guide | Riskified Technology

By Kfir Schneider

Image for post
Image for post
Kfir Schneider

As a Site Reliability Engineer and production champion at Riskified, one of my key roles is to ensure the high availability of our services in order to help our company achieve its business goals. Last year, one of these goals was to reduce cloud costs.

Riskified performs frictionless machine learning fraud prevention for enterprise online retailers. We review millions of orders a day, and our services must meet strict SLAs with a highly available production environment.

In this article I will guide you through how to significantly reduce costs in your k8s clusters, by using AWS EC2 Spot Instances, and hopefully give you the confidence you need in order to use Spot Instances with highly available workloads even in your production environment.

What are Spot Instances

EC2 Spot Instances represent spare computing capacity in AWS, that are offered at 60–80% discount over On-Demand price. They are managed in Spot Instance pools, which are sets of EC2 Instances with the same Instance type, OS and Availability Zone (AZ).
If a Spot Instance pool is no longer available then the Spot Instance could be interrupted, receiving a termination notification with a two-minute warning before being terminated.

In k8s, it makes sense to use Spot Instances on your worker nodes, due to the nature of pods’ indifference to the underlying infrastructure, and thanks to some k8s components that together protect your workloads from Spot interruptions.

Spot Instance Provision

With Spot Instances, each Instance type in each Availability Zone is a pool with its own Spot price, based on the available capacity. Amazon best practices recommend using a diversified fleet of Instances with multiple Instance types, as created by Spot Fleet or EC2 Fleet.
Unfortunately the k8s node autoscaler component (Cluster Autoscaler) does not support Spot Fleets, so we will have to choose a different strategy to run Spot Instances: AWS Auto Scaling Groups (ASG).

Auto Scaling Groups

An ASG contains a collection of Amazon EC2 Instances that are treated as one logical group. At Riskified we use kops to set up our k8s clusters, so I’ll demonstrate how to install Spot Instance ASGs with kops InstanceGroups.

We will not deep dive into kops in this article. If you use other k8s installation tools, like EKS, Kubeadm or kubespray you could also set your ASGs to run on Spot Instances with minor configuration adjustments, this will not be covered here.

In order to run Spot k8s nodes with kops we will create the Spot Instance group with:

kops create ig spot-nodes-xlarge

and edit the the default configuration to:

Following the configuration of this InstanceGroup, kops will create an EC2 ASG with a ‘mixedInstancesPolicy’ utilizing multiple Spot Instance types in a single group. The ‘capacity-optimizedallocation strategy allows the ASG to select the instance types with the highest capacity available while scaling up. This will reduce the chance of Spot interruptions.

Due to the Cluster Autoscaler’s limitations (more on that in the next section) on which Instance type to expand, it’s important to choose instances of the same size (vCPU and memory) for each InstanceGroup.

We can use amazon-ec2-instance-selector to help us select the relevant instance types and families with a sufficient number of vCPUs and Memory. For example, to get a group of instances with 8vCPUs and 32GB of RAM, we can run the following command:

ec2-instance-selector --vcpus 8 --memory 32768 --gpus 0 --current-generation true -a x86_64 --deny-list '.*n.*'

Cluster Autoscaler

Cluster Autoscaler (CA) is a tool that automatically scales the k8s cluster size, by changing the desired capacity of the ASGs. It will scale the cluster up when there are pods that fail to run due to insufficient resources, and scale it down when there are nodes in the cluster that have been underutilized for an extended period of time.

In the previous section, we created a single node group, but in most cases a single group is not enough and we will need more groups, e.g. for different machine sizes, GPU nodes, or for a group with a single AZ to support persistent volumes.

When there is more than one node group, and CA identifies that it needs to scale up the cluster due to unschedulable pods, it will have to decide which group to expand. We want CA to always prefer adding Spot Instances over On-demand.

CA uses expander to choose which group to scale. With AWS, CA provides 4 different Expander strategies for selecting the node group, to which new nodes will be added: random (default), most-pods, least-waste and priority.

Expanders can be selected by passing expander flag to CA arguments, i.e.


priority Expander selects the node group that was assigned the highest priority by the user on the values stored in a ConfigMap. This ConfigMap has to be created before the CA pod and must be named cluster-autoscaler-priority-expander (more details here). ConfigMap example is as follows:

By setting the .*spot-nodes.* (regex for node group names) with the highest priority, we tell the CA Expander to always prefer expanding the Spot node groups. CA respects nodeSelector and requiredDuringSchedulingIgnoredDuringExecution nodeAffinity, so it will only consider node groups that satisfy those requirements for expansion.

If there are no Spot Instances available, CA will fail to scale up the Spot groups and instead scale up the On-demand groups with less priority. With this approach you will gain a full and auto fall-back to On-Demand mechanism.

Instance Termination Handler

Image for post
Image for post

Now we’ll prepare our cluster to handle Spot interruptions. We’ll use the AWS Node Termination Handler for this purpose. It will run a pod on each Spot Instance node (a DaemonSet) that detects a Spot interruption warning notice by watching the AWS EC2 metadata.

A new feature called “EC2 Instance rebalance recommendation” was recently announced by AWS. A signal notifies you when a Spot Instance is at elevated risk of interruption. This signal can arrive sooner than the two-minute Spot Instance interruption notice, giving you the opportunity to proactively rebalance your workload before the interruption notice.

If an interruption or rebalance recommendation notice is detected, the handler will trigger a node drain. Node drain safely evicts all pods hosted on it. When a pod is evicted using the eviction API, it is gracefully terminated, honoring the terminationGracePeriodSeconds setting in its PodSpec.

Each of the evicted pods will then be rescheduled on a different node so that all the deployments will get back to their desired capacity.

A simple aws-node-termination-handler Helm installation example is as follows:

helm repo add eks upgrade --install aws-node-termination-handler \--namespace kube-system \--set nodeSelector.lifecycle=spot \--set enableSpotInterruptionDraining="true" \--set enableRebalanceMonitoring="true" \eks/aws-node-termination-handler

Prevent service downtime

By now we have a hybrid cluster that can auto scale Spot Instances, fall-back to On-Demand if necessary, and handle graceful pod evictions when a Spot node is reclaimed.

In a production environment where lots of services have to stay live 100% of the time, draining random nodes could lead to catastrophe quite easily.

For example, if:

  • All deployment pod replicas stay on a single Spot Instance pool (same machine type and AZ), with higher chances of being reclaimed at the same time;
  • All deployment pod replicas stay on nodes that are being reclaimed simultaneously by AWS, and could get evicted at the same time. By the time the new replicas are scheduled and ready on a different node, the service will have zero endpoints;
  • Rescheduled pods are waiting in a pending state for more than two minutes for new nodes to join the cluster. This could also lead to zero endpoints;

In the following sections you’ll gain a better understanding of how to prevent these scenarios from occurring.

Affinity Rules

Pod affinity and anti-affinity are rules that allow you to specify how pods should be scheduled relative to other pods. The rules are defined using custom labels on nodes and label selectors specified in pods. For example, using affinity rules, you could spread pods of a service across nodes or AZ.

There are two types of pod affinity rules: preferred and required. Preferred specifies that the scheduler will try to enforce the rules, but there’s no guarantee. Required, on the other hand, specifies that the rule must be met before a pod can be scheduled.

In the following example we use the preferred podAntiAffinity type:

By setting different weights, k8s scheduler will first try to spread those 3 redis replicas over different AZs (‘’ node label). If there is no room available in separate zones, it will continue to try to schedule them on different Instance types (instance-type node label). Lastly, if no place is available in either separate AZs or Instance types, it will try to spread the replicas across separate nodes (hostname node label).

Specifying such rules for critical deployments will help us distribute pods according to the Spot Instances Pool logic and minimize the chance of multiple terminations of the same component at the same time.


PodDisruptionBudget (PDB) is an API object that indicates the maximum number of disruptions that can be caused to a collection of pods. PDB can help us limit the number of concurrent evictions and prevent a service outage.

In the above example, we’re setting the eviction API to deny disruptions of redis pods if there is only one ready pod in the cluster.
So, if for example, the redis deployment has 3 replicas staying on single or multiple nodes that are being drained simultaneously, k8s will first evict two pods and then continue to the third one, only after one of the rescheduled pods has become ready in another node.

You can only specify one of maxUnavailable or minAvailable in a single PDB. Both can be expressed as integers or as a percentage.

To read more about PDB:

Cluster Headroom

Image for post
Image for post

Once a Spot Instance is reclaimed and the node is being drained, k8s will try to schedule the evicted pods. Most of the time, due to the cluster size elasticity that comes with CA, the scheduler will not find enough room for all the evicted pods and some of them will wait in a pending state until CA triggers a scale-up and new nodes are ready.
These precious minutes of waiting can be avoided by implementing cluster headroom (or cluster over-provisioning).

Before we jump into the implementation, you should be familiar with k8s Pod Priority.
In short, k8s pods can have priority. If a pod cannot be scheduled, k8s can evict lower-priority pods to make scheduling of a higher-priority pending pod possible.

To implement a cluster headroom, we run “dummy” over-provision pods with low priority to reserve extra room in the cluster. These pods will save the necessary place for critical pods that are evicted when a node is being drained. Over-provision pods will get resource request values and run a “pause” linux process, so they actively save extra room in the cluster without consuming any resources.
When over-provision pods are replaced with high priority pods, their status changes to pending and they become the ones waiting for new nodes instead of the critical workload.

Most of this can be done by the cluster-overprovisioner helm chart that will add two PriorityClasses, and the over-provision deployment configured with a low priorityClass.
The higher PriorityClass created here will be the globalDefault, so that all pods without a priorityClassName set will be higher than the over-provision pods.

Here’s the example value file for this chart:

To ensure that the overprovision deployment replicas count auto scales based on the size of the cluster’s Spot Instances, we can deploy a very useful tool called cluster-proportional-autoscaler that lets you scale a deployment based on the cluster size.

To scale the overprovision-spot deployment, run it in your cluster (examples here) with the following arguments:

/cluster-proportional-autoscaler--namespace={{ .Release.Namespace }}--configmap=overprovisioning-scale--target=deployment/overprovision-spot--nodelabels=lifecycle=spot--logtostderr=true


With a ConfigMap such as:

In this example, we set the cluster-proportional-autoscaler to scale overprovision-spot deployment to one replica for every 50 CPU cores of all Spot Instance nodes.

Both coresPerReplica and cluster-overprovisioning request settings (CPU and memory in the cluster-overprovisioner chart) should be fine-tuned based on your headroom needs.