Draining kubernetes clusters

Written by
Link to Post


Cloud cost management series:
Overspending in the cloud
Managing spot instance clusters on Kubernetes with Hollowtrees
Monitor AWS spot instance terminations
Diversifying AWS auto-scaling groups
Draining Kubernetes nodes
Cluster recommender
Cloud instance type and price information as a service

Kubernetes was designed in a way to be fault tolerant to worker node failures. If a node goes missing because of a hardware problem, a cloud infrastructure problem, or in general Kubernetes simply no longer receives heartbeat messages from that node because of any reason, the Kubernetes control plane is clever enough to handle these failures. But it doesn’t mean that it will be able to solve every problem that can happen.

A common misconception is the following: “Kubernetes will re-schedule all the pods from the lost node to another if there are enough free resources available elsewhere so why should we care about losing a node? Everything will be re-scheduled anyways, the autoscaler will add a new node if needed and life goes on”. To topple this misconception we’ll take a look at what disruptions mean and how the kubectl drain command works and what does it do to drain a node gracefully. The cluster autoscaler uses a similar logic to scale a cluster in, and our Pipeline PaaS has a similar feature as well to automatically handle spot instance terminations gracefully via Hollowtrees.

Pod disruptions

When a pod disappears from the cluster it can happen because of two things:

  • there was some unavoidable hardware, software or user error
  • the pod was deleted voluntarily because someone wanted to delete it’s deployment or wanted to take away the VM that held the pod

The Kubernetes documentation calls these two things voluntary and involuntary disruptions. When a “node goes missing” it can be considered an involuntary disruption. Involuntary disruptions are harder to deal with then voluntary disruptions (read below for a deeper explanation) but you can do a few things to mitigate its effects. The documentation lists a few of these preventive methods from trivial ones like pod replication to complicated ones like multi-zone clusters. You should take a look at these and you should do your best to avoid problems of involuntary disruptions because these will surely happen if you’re running a larger cluster. But even if you’re doing your best problems will arise sooner or later, especially in multi-tenant clusters when not everyone who’s using the cluster is fully aware of these things.

So what can you do against it other than those things? There are some cases when we can prevent involuntary disruptions and change these to voluntary disruptions, like AWS spot instance termination, or other cases when monitoring can predict failures in advance. Having a voluntary disruption allows the cluster to gracefully accommodate to the new situation and making the transition as seamless as possible. In the next part we’ll take the kubectl drain command as an example to look into voluntary disruptions and to see how is it more graceful than involuntary disruptions.

The kubectl drain command

According to the Kubernetes documentation the drain command can be used to “safely evict all of your pods from a node before you perform maintenance on the node”, and “safe evictions allow the pod’s containers to gracefully terminate and will respect the PodDisruptionBudgets you have specified”. So if it’s not a problem if a node is being removed from the cluster then why do we need this safe eviction and what does it mean exactly?

From a bird’s eye view drain does two things:

1. cordons the node

This part is quite simple, cordoning means that the node will be marked unschedulable, so new pods cannot be scheduled to this node anymore. If we know in advance that a node will be taken away from the cluster (because of maintenance – like a kernel update, or because we know that the node will be scaled in), it should be the first step. We don’t want new pods scheduled on this node and then taken away after a few seconds. A good example of this is when we know 2 minutes in advance that a spot instance on AWS will be terminated – new pods shouldn’t be scheduled on that node, and we can work towards gracefully scheduling all the other pods as well. On the API level cordoning means patching the node with node.Spec.Unschedulable=true.

2. evicts or deletes the pods

After the node is made unschedulable, the drain command will try to evict the pods that are already running on that node. If eviction is supported on the cluster (from Kubernetes version 1.7) the drain command will use the Eviction API that takes disruption budgets into account, if it’s not supported it will simply delete the pods on the node. Let’s look into these options next.

Deleting pods on a node

Start with the simpler one, when the Eviction API cannot be used. This is how it looks from go code:

err := client.CoreV1().Pods(pod.Namespace).Delete(pod.Name, &metav1.DeleteOptions{
  GracePeriodSeconds: &gracePeriodSeconds,

Other than the trivial things like calling the Delete method of the K8S client, the first thing you can catch is GracePeriodSeconds. As always, the really good documentation of Kubernetes can help us explain things:

“Because pods represent running processes on nodes in the cluster, it is important to allow those processes to gracefully terminate when they are no longer needed (vs being violently killed with a KILL signal and having no chance to clean up).”

Cleaning up can mean a bunch of things, like completing any outstanding HTTP requests, making sure that data is flushed properly when writing a file, finishing a batch job, rolling back transactions, or saving state to external storage like S3. There is a timeout to clean things up, it is called the grace period. Note that when you call delete on a pod it returns asynchronously, and you should always poll that pod and wait until the deletion finishes or the grace period ends. Check the docs to learn more about the details.

If the node is disrupted involuntarily the processes in the pods will have no chance to exit gracefully. So back to the spot instance termination example: if all we can do in those 2 minutes before the VM is terminated is to cordon the node and call Delete on the pods with a grace period of about 2 minutes we’re still better than just letting our instance die. But there are still better options that Kubernetes provides.

Evicting pods from a node

From Kubernetes 1.7 there is an option to use the Eviction API instead of directly deleting a pod. First let’s see the go code again and check the differences from the above. It’s easy to see that it is a different API call, but we still have to provide pod.Namespace, pod.Name and DeleteOptions with the grace period. Additionally we have to add some meta info (EvictionKind and APIVersion), but it looks very similar at first sight.

eviction := &policyv1beta1.Eviction{
  TypeMeta: metav1.TypeMeta{
                APIVersion: policyGroupVersion,
                Kind:       EvictionKind,
        ObjectMeta: metav1.ObjectMeta{
                Name:      pod.Name,
                Namespace: pod.Namespace,
        DeleteOptions: &metav1.DeleteOptions{
                GracePeriodSeconds: &gracePeriodSeconds,

So what does it add to the delete API?

Kubernetes has a resource type of poddisruptionbudget, or pdb that can be attached to a deployment via labels. According to the documentation:

A PDB limits the number of pods of a replicated application that are down simultaneously from voluntary disruptions.

The following very simple example of a PDB specifies that the minimum available pods of the nginx app cannot be less then 70% at any time (see more examples here):

kubectl create pdb my-pdb --selector=app=nginx --min-available=70%

When calling the Eviction API it will only allow eviction of a pod if it wouldn’t collide with a PDB. If there are no PDBs that would be broken by the eviction, the pod will be deleted gracefully just like with simple Delete. If the delete is not granted because a PDB is not allowing it, then 429 Too Many Requests is returned. See more details here.

If you’re calling the drain command and it cannot evict a pod because of a PDB, it will simply sleep 5 seconds and retry again. You can try it out by creating a basic nginx deployment with 2 replicas, add the pdb above, find the node where one of the pods were scheduled and try to drain it with this command (--v=6 is only needed to see the Too Many Requests messages that were returned):

kubectl --v=6 drain <node-name> --force

This simple logic should work in most cases because if you’re setting values in a PDB that makes sense (e.g.: min 2 available when replicas are set to 3 and pod anti-affinity is set for hostnames), then it is a state that should only be temporary for the cluster – the controller will try to restore the 3 replicas and will succeed if there are free resources in the cluster. Once it restores the 3 replicas, drain can succeed. But also note that eviction and drain can cause deadlocks, when drain will wait forever. Usually these are misconfigurations, just like my very simple example where neither of the 2 nginx replicas can be evicted because of the 70% threshold, but it can happen in some real-world situations as well. The Eviction API won’t start new replicas on other nodes or do any other magic, but return Too Many Requests. To handle these cases you must intervene manually (e.g.: by temporary adding a new replica), or write your code in a way to detect it.

Special pods to delete

Let’s complicate things even more. There are some pods that couldn’t be simply deleted or evicted. The drain command uses 4 different filters when checking for pods to delete, and these filters can temporarily reject the drain or the drain can move on without touching these pods:

DaemonSet filter

The DaemonSet controller ignores unschedulable markings, so a pod that belongs to a DaemonSet would be immediately replaced. If there are pods belonging to a DaemonSet on the node, the drain command proceeds only if the --ignore-daemonsets flag is set to true, but even in this case it won’t delete the pod because of the DaemonSet controller. Usually it shouldn’t cause problems if a DaemonSet pod is deleted with a node (see node exporters, logs collection, storage daemons, etc.), so in most cases this flag can be set.

Mirror pods filter

drain is using the Kubernetes API server to manage pods and other resources, and mirror pods are only the corresponding read-only API resources of static pods – pods that are managed by the Kubelet directly without the API server managing it. Mirror pods are visible from the API server but cannot be controlled, so drain won’t delete these either.

Unreplicated filter

If a pod has no controller it cannot be simply deleted because it won’t be rescheduled to a new node. It is usually not advised to have pods without controllers (not managed by a ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet), but if you still have pods like this, and want to write code that handles voluntary node disruptions it is up for the implementation to delete these pods, or fail. The drain command lets the user decide: when --force is set, unreplicated pods will be deleted (or evicted), if it’s not set then drain will fail.

When using go, the k8s apimachinery package has a util function that returns the controller for a pod, or nil if there’s no controller for it: metav1.GetControllerOf(&pod)

LocalStorage filter

This filter checks if emptyDir exists for a pod or not. If the pod uses emptyDir to store local data, it may not be safe to delete because if a pod is removed from a node the data in the emptyDir is deleted with it. Just like with the unreplicated filter, it is up for the implementation to decide what to do with these pods. drain provides a switch for this as well, if --delete-local-data is set, drain will proceed even if there are pods using the emptyDir and will delete the pods and therefore delete the local data as well.

Spot instance termination

We are using a drain-like logic to handle AWS spot instance terminations. We are monitoring AWS spot instance terminations with Prometheus, and have Hollowtrees configured to call our Kubernetes action plugin to drain the node. AWS gives the notice 2 minutes in advance and that’s usually enough time to gracefully delete the pods, while also watching for PodDisruptionBudgets. Our action plugin uses a very similar logic to the drain command, but ignores DaemonSets and mirror pods, and force deletes unreplicated and emptyDir pods by default.


If you’d like to learn more about Banzai Cloud check out our other posts in the blog, the Pipeline and Hollowtrees projects on Github or follow us on our social channels.


Article Tags:
Article Categories:

Comments are closed.