Node の cordon/drain について整理し挙動を確認しました。

1. 説明

1.1. cordon/uncordon

Node は以下のどちらかのステータスを持ちます。

ステータス	説明
`SchedulingEnabled`	Node がスケジューリング対象になっている状態（Pod を新たに起動できる状態）
`SchedulingDisabled`	Node がスケジューリング対象から外れている状態（Pod を新たに起動できない状態）

このステータスを変更する際に cordon/uncordon コマンドを使用します。

kubectl cordon <Node> で Node のステータスを SchedulingDisabled に変更し、kubectl uncordon <Node> で SchedulingEnabled に戻します。

なお、cordon で SchedulingDisabled に変更しても、元々その Node で起動していた Pod に影響はありません。

1.2. drain

cordon を実行しても既存の Pod には影響がありません。新しい Pod を起動できないだけです。

既存の Pod も退避させる場合は drain を使用します。kubectl drain <Node> を実行すると Node を SchedulingDisabled に変更してから各 Pod に SIGTERM シグナルを送信して Pod を退避します。drain 処理には cordon 処理が含まれるため drain の前に cordon を実行する必要はありません。

なお、drain 時は Pod に SIGTERM を送るため、Pod 上のアプリケーションが SIGTERM/SIGKILL に対応している必要があります。（アプリケーションが SIGTERM を受け取った際に処置が途中のものは処理が完全に完了してからアプリを落とす作りにする必要がある）

また、特定の Pod が起動している Node では drain 時にエラーが発生します。その場合はオプションを与える事で drain できます。

エラーとなる Pod	エラーとなる理由	drain 時に必要なオプション
DaemonSet が管理している Pod	DaemonSet のため、Pod を退避して他の Node で起動できない	`ignore-daemonsets`
emptyDir を使用している Pod	Pod を削除すると emptyDir のデータも消える（emptyDir のデータは Pod のローカルのため Pod の削除と共に消える）	`delete-emptydir-data`
ReplicationController, ReplicaSet, Job, DaemonSet, StatefulSet が管理していない Pod	管理されていないため Pod 退避後に他の Node で起動できない	`force`

drain 時に Pod の退避数を制限できる PodDisruptionBudget(PDB) というリソースについては別記事で整理します。

2. 検証

2.1. 検証環境構築

eksctl コマンドで EKS Cluster を作成する - YasuBlog で作成した EKS Cluster を使用します。

適当な Deployment も起動しておきます。

apiVersion: apps/v1
kind: Deployment
metadata:
  name: test-deployment
spec:
  replicas: 3
  selector:
    matchLabels:
      app: app
  template:
    metadata:
      labels:
        app: app
    spec:
      containers:
      - name: amazonlinux
        image: public.ecr.aws/amazonlinux/amazonlinux:latest
        command:
          - "bin/bash"
          - "-c"
          - "sleep 3600"

$ kubectl apply -f test-deployment.yaml
deployment.apps/deployment created

2.2. cordon

まずは cordon 前の Node,Pod の状態です。3 Node に Pod が一つずつ起動している状態です。

$ kubectl get node
NAME                                              STATUS   ROLES    AGE   VERSION
ip-10-0-101-239.ap-northeast-1.compute.internal   Ready    <none>   61m   v1.21.5-eks-9017834
ip-10-0-102-44.ap-northeast-1.compute.internal    Ready    <none>   61m   v1.21.5-eks-9017834
ip-10-0-103-228.ap-northeast-1.compute.internal   Ready    <none>   60m   v1.21.5-eks-9017834
$ kubectl get pod -o wide
NAME                          READY   STATUS    RESTARTS   AGE   IP             NODE                                              NOMINATED NODE   READINESS GATES
test-deployment-6bb985c8c9-g8qhq   1/1     Running   0          44s   10.0.103.252   ip-10-0-103-228.ap-northeast-1.compute.internal   <none>           <none>
test-deployment-6bb985c8c9-g9tx2   1/1     Running   0          44s   10.0.101.212   ip-10-0-101-239.ap-northeast-1.compute.internal   <none>           <none>
test-deployment-6bb985c8c9-wp7zv   1/1     Running   0          44s   10.0.102.88    ip-10-0-102-44.ap-northeast-1.compute.internal    <none>           <none>

この状態で cordon を実行します。

$ kubectl cordon ip-10-0-103-228.ap-northeast-1.compute.internal
node/ip-10-0-103-228.ap-northeast-1.compute.internal cordoned
$ kubectl get node
NAME                                              STATUS                     ROLES    AGE   VERSION
ip-10-0-101-239.ap-northeast-1.compute.internal   Ready                      <none>   65m   v1.21.5-eks-9017834
ip-10-0-102-44.ap-northeast-1.compute.internal    Ready                      <none>   65m   v1.21.5-eks-9017834
ip-10-0-103-228.ap-northeast-1.compute.internal   Ready,SchedulingDisabled   <none>   65m   v1.21.5-eks-9017834
$ kubectl  get pod -o wide
NAME                          READY   STATUS    RESTARTS   AGE   IP             NODE                                              NOMINATED NODE   READINESS GATES
test-deployment-6bb985c8c9-g8qhq   1/1     Running   0          56s   10.0.103.252   ip-10-0-103-228.ap-northeast-1.compute.internal   <none>           <none>
test-deployment-6bb985c8c9-g9tx2   1/1     Running   0          56s   10.0.101.212   ip-10-0-101-239.ap-northeast-1.compute.internal   <none>           <none>
test-deployment-6bb985c8c9-wp7zv   1/1     Running   0          56s   10.0.102.88    ip-10-0-102-44.ap-northeast-1.compute.internal    <none>           <none>

cordon を実行した Node の STATUS が SchedulingDisabled になりました。既存の Pod には変化はありませんでした。

describe node を実行すると Taint に node.kubernetes.io/unschedulable:NoSchedule が設定されていました。

% kubectl describe node ip-10-0-103-228.ap-northeast-1.compute.internal
~省略~

Taints:             node.kubernetes.io/unschedulable:NoSchedule
Unschedulable:      true

~省略~

この状態で Deployment の replicas の値を 3 から 10 に変更します。

$ kubectl scale deployment/test-deployment --replicas=10
deployment.apps/test-deployment scaled
$ kubectl get pod -o wide
NAME                               READY   STATUS    RESTARTS   AGE   IP             NODE                                              NOMINATED NODE   READINESS GATES
test-deployment-6bb985c8c9-4p7qw   1/1     Running   0          5s    10.0.102.14    ip-10-0-102-44.ap-northeast-1.compute.internal    <none>           <none>
test-deployment-6bb985c8c9-g8qhq   1/1     Running   0          74s   10.0.103.252   ip-10-0-103-228.ap-northeast-1.compute.internal   <none>           <none>
test-deployment-6bb985c8c9-g9tx2   1/1     Running   0          74s   10.0.101.212   ip-10-0-101-239.ap-northeast-1.compute.internal   <none>           <none>
test-deployment-6bb985c8c9-hwsmw   1/1     Running   0          5s    10.0.101.177   ip-10-0-101-239.ap-northeast-1.compute.internal   <none>           <none>
test-deployment-6bb985c8c9-ldr7h   1/1     Running   0          5s    10.0.101.65    ip-10-0-101-239.ap-northeast-1.compute.internal   <none>           <none>
test-deployment-6bb985c8c9-nx649   1/1     Running   0          5s    10.0.102.184   ip-10-0-102-44.ap-northeast-1.compute.internal    <none>           <none>
test-deployment-6bb985c8c9-rgslw   1/1     Running   0          5s    10.0.102.42    ip-10-0-102-44.ap-northeast-1.compute.internal    <none>           <none>
test-deployment-6bb985c8c9-vd8bn   1/1     Running   0          5s    10.0.102.208   ip-10-0-102-44.ap-northeast-1.compute.internal    <none>           <none>
test-deployment-6bb985c8c9-wm77s   1/1     Running   0          5s    10.0.101.186   ip-10-0-101-239.ap-northeast-1.compute.internal   <none>           <none>
test-deployment-6bb985c8c9-wp7zv   1/1     Running   0          74s   10.0.102.88    ip-10-0-102-44.ap-northeast-1.compute.internal    <none>           <none>

新しい Pod が 7 個起動しましたが、cordon した Node（SchedulingDisabled になっている Node）には新しい Pod が起動していないことが確認できました。

2.3. uncordon

uncordon を実行して SchedulingDisabled から SchedulingEnabled に戻します。

まずは uncordon 前の状態です。一つの Node が SchedulingDisabled の状態です。

$ kubectl get node
NAME                                              STATUS                     ROLES    AGE   VERSION
ip-10-0-101-239.ap-northeast-1.compute.internal   Ready                      <none>   84m   v1.21.5-eks-9017834
ip-10-0-102-44.ap-northeast-1.compute.internal    Ready                      <none>   84m   v1.21.5-eks-9017834
ip-10-0-103-228.ap-northeast-1.compute.internal   Ready,SchedulingDisabled   <none>   83m   v1.21.5-eks-9017834
$ kubectl get pod -o wide
NAME                               READY   STATUS    RESTARTS   AGE   IP             NODE                                              NOMINATED NODE   READINESS GATES
test-deployment-6bb985c8c9-8l8rp   1/1     Running   0          40s   10.0.101.84    ip-10-0-101-239.ap-northeast-1.compute.internal   <none>           <none>
test-deployment-6bb985c8c9-9s7lm   1/1     Running   0          40s   10.0.102.176   ip-10-0-102-44.ap-northeast-1.compute.internal    <none>           <none>
test-deployment-6bb985c8c9-t8bzh   1/1     Running   0          40s   10.0.101.113   ip-10-0-101-239.ap-northeast-1.compute.internal   <none>           <none>

uncordon を実行します。

$ kubectl uncordon ip-10-0-103-228.ap-northeast-1.compute.internal
node/ip-10-0-103-228.ap-northeast-1.compute.internal uncordoned
$ kubectl get node
NAME                                              STATUS   ROLES    AGE   VERSION
ip-10-0-101-239.ap-northeast-1.compute.internal   Ready    <none>   86m   v1.21.5-eks-9017834
ip-10-0-102-44.ap-northeast-1.compute.internal    Ready    <none>   86m   v1.21.5-eks-9017834
ip-10-0-103-228.ap-northeast-1.compute.internal   Ready    <none>   85m   v1.21.5-eks-9017834

Node の STATUS から SchedulingDisabled が消えました。

describe node を実行すると node.kubernetes.io/unschedulable:NoSchedule の Taint が消えていました。

% kubectl describe node ip-10-0-103-228.ap-northeast-1.compute.internal
~省略~

Taints:             <none>
Unschedulable:      false

~省略~

この状態で Deployment の replicas の値を 3 から 10 に変更します。

$ kubectl scale deployment/test-deployment --replicas=10
deployment.apps/test-deployment scaled
$ kubectl get pod -o wide
NAME                               READY   STATUS    RESTARTS   AGE     IP             NODE                                              NOMINATED NODE   READINESS GATES
test-deployment-6bb985c8c9-2nt8q   1/1     Running   0          37s     10.0.102.208   ip-10-0-102-44.ap-northeast-1.compute.internal    <none>           <none>
test-deployment-6bb985c8c9-8l8rp   1/1     Running   0          3m18s   10.0.101.84    ip-10-0-101-239.ap-northeast-1.compute.internal   <none>           <none>
test-deployment-6bb985c8c9-9s7lm   1/1     Running   0          3m18s   10.0.102.176   ip-10-0-102-44.ap-northeast-1.compute.internal    <none>           <none>
test-deployment-6bb985c8c9-fc444   1/1     Running   0          37s     10.0.103.215   ip-10-0-103-228.ap-northeast-1.compute.internal   <none>           <none>
test-deployment-6bb985c8c9-lgmpg   1/1     Running   0          37s     10.0.103.153   ip-10-0-103-228.ap-northeast-1.compute.internal   <none>           <none>
test-deployment-6bb985c8c9-m8mqk   1/1     Running   0          37s     10.0.101.212   ip-10-0-101-239.ap-northeast-1.compute.internal   <none>           <none>
test-deployment-6bb985c8c9-qnnr5   1/1     Running   0          37s     10.0.102.184   ip-10-0-102-44.ap-northeast-1.compute.internal    <none>           <none>
test-deployment-6bb985c8c9-t8bzh   1/1     Running   0          3m18s   10.0.101.113   ip-10-0-101-239.ap-northeast-1.compute.internal   <none>           <none>
test-deployment-6bb985c8c9-thbf5   1/1     Running   0          37s     10.0.102.42    ip-10-0-102-44.ap-northeast-1.compute.internal    <none>           <none>
test-deployment-6bb985c8c9-zpjmx   1/1     Running   0          37s     10.0.103.104   ip-10-0-103-228.ap-northeast-1.compute.internal   <none>           <none>

uncordon した Node に Pod が起動されている事を確認できました。

2.4. drain

2.4.1. DaemonSet がある場合

まずは drain 前の Node,Pod の状態です。3 Node に Pod が一つずつ起動している状態です。

$ kubectl get node
NAME                                              STATUS   ROLES    AGE   VERSION
ip-10-0-101-239.ap-northeast-1.compute.internal   Ready    <none>   98m   v1.21.5-eks-9017834
ip-10-0-102-44.ap-northeast-1.compute.internal    Ready    <none>   98m   v1.21.5-eks-9017834
ip-10-0-103-228.ap-northeast-1.compute.internal   Ready    <none>   98m   v1.21.5-eks-9017834
$ kubectl get pod -o wide
NAME                               READY   STATUS    RESTARTS   AGE   IP             NODE                                              NOMINATED NODE   READINESS GATES
test-deployment-6bb985c8c9-kgndl   1/1     Running   0          8s    10.0.103.153   ip-10-0-103-228.ap-northeast-1.compute.internal   <none>           <none>
test-deployment-6bb985c8c9-t7wqf   1/1     Running   0          8s    10.0.102.42    ip-10-0-102-44.ap-northeast-1.compute.internal    <none>           <none>
test-deployment-6bb985c8c9-zscgl   1/1     Running   0          8s    10.0.101.177   ip-10-0-101-239.ap-northeast-1.compute.internal   <none>           <none>

drain を実行します。

$ kubectl drain ip-10-0-103-228.ap-northeast-1.compute.internal
node/ip-10-0-103-228.ap-northeast-1.compute.internal cordoned
error: unable to drain node "ip-10-0-103-228.ap-northeast-1.compute.internal" due to error:cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): kube-system/aws-node-jls99, kube-system/kube-proxy-x95dw, continuing command...
There are pending nodes to be drained:
 ip-10-0-103-228.ap-northeast-1.compute.internal
cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): kube-system/aws-node-jls99, kube-system/kube-proxy-x95dw

エラーになりました。これは EKS の場合だと VPC CNI プラグインの aws-node や kube-proxy が DaemonSet で動いているためです。エラー文言にも書いてありますが最初の説明に記載した通り --ignore-daemonsets オプションを付与する事で drain できます。

$ kubectl drain ip-10-0-103-228.ap-northeast-1.compute.internal --ignore-daemonsets
node/ip-10-0-103-228.ap-northeast-1.compute.internal already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/aws-node-jls99, kube-system/kube-proxy-x95dw
evicting pod default/test-deployment-6bb985c8c9-kgndl
pod/test-deployment-6bb985c8c9-kgndl evicted
node/ip-10-0-103-228.ap-northeast-1.compute.internal drained
$ kubectl get node
NAME                                              STATUS                     ROLES    AGE    VERSION
ip-10-0-101-239.ap-northeast-1.compute.internal   Ready                      <none>   104m   v1.21.5-eks-9017834
ip-10-0-102-44.ap-northeast-1.compute.internal    Ready                      <none>   104m   v1.21.5-eks-9017834
ip-10-0-103-228.ap-northeast-1.compute.internal   Ready,SchedulingDisabled   <none>   103m   v1.21.5-eks-9017834

Node が drain されました。cordon を実行しなくても SchedulingDisabled になりました。

Taint には node.kubernetes.io/unschedulable:NoSchedule が設定されています。

% kubectl describe node ip-10-0-103-228.ap-northeast-1.compute.internal
~省略~

Taints:             node.kubernetes.io/unschedulable:NoSchedule
Unschedulable:      true

~省略~

drain 中の Pod の状態遷移です。

$ kubectl get pod -o wide -w
NAME                               READY   STATUS    RESTARTS   AGE   IP             NODE                                              NOMINATED NODE   READINESS GATES
test-deployment-6bb985c8c9-kgndl   1/1     Running   0          60s   10.0.103.153   ip-10-0-103-228.ap-northeast-1.compute.internal   <none>           <none>
test-deployment-6bb985c8c9-t7wqf   1/1     Running   0          60s   10.0.102.42    ip-10-0-102-44.ap-northeast-1.compute.internal    <none>           <none>
test-deployment-6bb985c8c9-zscgl   1/1     Running   0          60s   10.0.101.177   ip-10-0-101-239.ap-northeast-1.compute.internal   <none>           <none>
test-deployment-6bb985c8c9-kgndl   1/1     Terminating   0          3m32s   10.0.103.153   ip-10-0-103-228.ap-northeast-1.compute.internal   <none>           <none>
test-deployment-6bb985c8c9-t2xpr   0/1     Pending       0          0s      <none>         <none>                                            <none>           <none>
test-deployment-6bb985c8c9-t2xpr   0/1     Pending       0          0s      <none>         ip-10-0-102-44.ap-northeast-1.compute.internal    <none>           <none>
test-deployment-6bb985c8c9-t2xpr   0/1     ContainerCreating   0          0s      <none>         ip-10-0-102-44.ap-northeast-1.compute.internal    <none>           <none>
test-deployment-6bb985c8c9-t2xpr   1/1     Running             0          3s      10.0.102.184   ip-10-0-102-44.ap-northeast-1.compute.internal    <none>           <none>
test-deployment-6bb985c8c9-kgndl   0/1     Terminating         0          4m2s    10.0.103.153   ip-10-0-103-228.ap-northeast-1.compute.internal   <none>           <none>
test-deployment-6bb985c8c9-kgndl   0/1     Terminating         0          4m6s    10.0.103.153   ip-10-0-103-228.ap-northeast-1.compute.internal   <none>           <none>
test-deployment-6bb985c8c9-kgndl   0/1     Terminating         0          4m6s    10.0.103.153   ip-10-0-103-228.ap-northeast-1.compute.internal   <none>           <none>
$ kubectl get pod -o wide
NAME                               READY   STATUS    RESTARTS   AGE    IP             NODE                                              NOMINATED NODE   READINESS GATES
test-deployment-6bb985c8c9-t2xpr   1/1     Running   0          93s    10.0.102.184   ip-10-0-102-44.ap-northeast-1.compute.internal    <none>           <none>
test-deployment-6bb985c8c9-t7wqf   1/1     Running   0          5m5s   10.0.102.42    ip-10-0-102-44.ap-northeast-1.compute.internal    <none>           <none>
test-deployment-6bb985c8c9-zscgl   1/1     Running   0          5m5s   10.0.101.177   ip-10-0-101-239.ap-northeast-1.compute.internal   <none>           <none>

drain 対象の Node で起動していた Pod が終了し別 Node で起動した事が確認できました。

2.4.2. emptyDir を使用する Pod がある場合

以下 manifest の Deployment を起動して検証します。

apiVersion: apps/v1
kind: Deployment
metadata:
  name: test-emptydir-deployment
spec:
  replicas: 3
  selector:
    matchLabels:
      app: app-emptydir
  template:
    metadata:
      labels:
        app: app-emptydir
    spec:
      containers:
      - name: amazonlinux
        image: public.ecr.aws/amazonlinux/amazonlinux:latest
        command:
          - "bin/bash"
          - "-c"
          - "sleep 3600"
        volumeMounts:
        - mountPath: /hoge
          name: hoge-volume
      volumes:
      - name: hoge-volume
        emptyDir:
          sizeLimit: 1Gi

drain すると以下エラーが出ました。

$ kubectl drain ip-10-0-101-69.ap-northeast-1.compute.internal --ignore-daemonsets
node/ip-10-0-101-69.ap-northeast-1.compute.internal cordoned
error: unable to drain node "ip-10-0-101-69.ap-northeast-1.compute.internal" due to error:cannot delete Pods with local storage (use --delete-emptydir-data to override): default/test-emptydir-deployment-79859745dd-l6fbh, continuing command...
There are pending nodes to be drained:
 ip-10-0-101-69.ap-northeast-1.compute.internal
cannot delete Pods with local storage (use --delete-emptydir-data to override): default/test-emptydir-deployment-79859745dd-l6fbh

delete-emptydir-data オプションを付与する事で drain できます。

$ kubectl drain ip-10-0-101-69.ap-northeast-1.compute.internal --ignore-daemonsets --delete-emptydir-data
node/ip-10-0-101-69.ap-northeast-1.compute.internal already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/aws-node-8597j, kube-system/kube-proxy-pw5st
evicting pod default/test-emptydir-deployment-79859745dd-l6fbh
pod/test-emptydir-deployment-79859745dd-l6fbh evicted
node/ip-10-0-101-69.ap-northeast-1.compute.internal drained

2.4.3. ReplicationController, ReplicaSet, Job, DaemonSet, StatefulSet が管理していない Pod がある場合

以下 manifest の Pod を起動して検証します。

apiVersion: v1
kind: Pod
metadata:
  name: test-pod
spec:
  containers:
    - name: amazonlinux
      image: public.ecr.aws/amazonlinux/amazonlinux:latest
      command:
        - "bin/bash"
        - "-c"
        - "sleep 3600"

drain すると以下エラーが出ました。

$ kubectl drain ip-10-0-103-89.ap-northeast-1.compute.internal --ignore-daemonsets --delete-emptydir-data
node/ip-10-0-103-89.ap-northeast-1.compute.internal already cordoned
error: unable to drain node "ip-10-0-103-89.ap-northeast-1.compute.internal" due to error:cannot delete Pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet (use --force to override): default/test-pod, continuing command...
There are pending nodes to be drained:
 ip-10-0-103-89.ap-northeast-1.compute.internal
cannot delete Pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet (use --force to override): default/test-pod

force オプションを付与する事で drain できます。

$ kubectl drain ip-10-0-103-89.ap-northeast-1.compute.internal --ignore-daemonsets --delete-emptydir-data --force
node/ip-10-0-103-89.ap-northeast-1.compute.internal already cordoned
WARNING: deleting Pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet: default/test-pod; ignoring DaemonSet-managed Pods: kube-system/aws-node-vnmfx, kube-system/kube-proxy-6x767
evicting pod kube-system/coredns-76f4967988-7slnc
evicting pod default/test-pod
pod/coredns-76f4967988-7slnc evicted
pod/test-pod evicted
node/ip-10-0-103-89.ap-northeast-1.compute.internal drained