Troubleshooting Pods
This chapter is about pods troubleshooting, which are applications deployed into Kubernetes.
Usually, no matter which errors are you run into, the first step is getting pod's current state and its logs
1
kubectl describe pod <pod-name>
2
kubectl logs <pod-name>
Copied!
The pod events and its logs are usually helpful to identify the issue.

Pod stuck in Pending

Pending state indicates the Pod hasn't been scheduled yet. Check pod events and they will show you why the pod is not scheduled.
1
$ kubectl describe pod mypod
2
...
3
Events:
4
Type Reason Age From Message
5
---- ------ ---- ---- -------
6
Warning FailedScheduling 12s (x6 over 27s) default-scheduler 0/4 nodes are available: 2 Insufficient cpu.
Copied!
Generally this is because there are insufficient resources of one type or another that prevent scheduling. An incomplete list of things that could go wrong includes
  • Cluster doesn't have enough resources, e.g. CPU, memory or GPU. You need to adjust pod's resource request or add new nodes to cluster
  • Pod requests more resources than node's capacity. You need to adjust pod's resource request or add larger nodes with more resources to cluster
  • Pod is using hostPort, but the port is already been taken by other services. Try using a Service if you're in such scenario

Pod stuck in Waiting or ContainerCreating

In such case, Pod has been scheduled to a worker node, but it can't run on that machine.
Again, get information from kubectl describe pod <pod-name> and check what's wrong.
1
$ kubectl -n kube-system describe pod nginx-pod
2
Events:
3
Type Reason Age From Message
4
---- ------ ---- ---- -------
5
Normal Scheduled 1m default-scheduler Successfully assigned nginx-pod to node1
6
Normal SuccessfulMountVolume 1m kubelet, gpu13 MountVolume.SetUp succeeded for volume "config-volume"
7
Normal SuccessfulMountVolume 1m kubelet, gpu13 MountVolume.SetUp succeeded for volume "coredns-token-sxdmc"
8
Warning FailedSync 2s (x4 over 46s) kubelet, gpu13 Error syncing pod
9
Normal SandboxChanged 1s (x4 over 46s) kubelet, gpu13 Pod sandbox changed, it will be killed and re-created.
Copied!
So the sandbox for this Pod isn't able to start. Let's check kubelet's logs for detailed reasons:
1
$ journalctl -u kubelet
2
...
3
Mar 14 04:22:04 node1 kubelet[29801]: E0314 04:22:04.649912 29801 cni.go:294] Error adding network: failed to set bridge addr: "cni0" already has an IP address different from 10.244.4.1/24
4
Mar 14 04:22:04 node1 kubelet[29801]: E0314 04:22:04.649941 29801 cni.go:243] Error while adding to cni network: failed to set bridge addr: "cni0" already has an IP address different from 10.244.4.1/24
5
Mar 14 04:22:04 node1 kubelet[29801]: W0314 04:22:04.891337 29801 cni.go:258] CNI failed to retrieve network namespace path: Cannot find network namespace for the terminated container "c4fd616cde0e7052c240173541b8543f746e75c17744872aa04fe06f52b5141c"
6
Mar 14 04:22:05 node1 kubelet[29801]: E0314 04:22:05.965801 29801 remote_runtime.go:91] RunPodSandbox from runtime service failed: rpc error: code = 2 desc = NetworkPlugin cni failed to set up pod "nginx-pod" network: failed to set bridge addr: "cni0" already has an IP address different from 10.244.4.1/24
Copied!
Now we know "cni0" bridge has been configured an unexpected IP address. A simplest way to fix this issue is deleting the "cni0" bridge (network plugin will recreate it when required):
1
$ ip link set cni0 down
2
$ brctl delbr cni0 #ip link delete cni0 type bridge(in case if you can't bring down the bridge)
Copied!
Above is an example of network configuration issue. There are also many other things may go wrong. An incomplete list of them includes
  • Failed to pull image, e.g.
    • image name is wrong
    • registry is not accessible
    • image hasn't been pushed to registry
    • docker secret is wrong or not configured for secret image
    • timeout because of big size (adjusting kubelet --image-pull-progress-deadline and --runtime-request-timeout could help for this case)
  • Network setup error for pod's sandbox, e.g.
    • can't setup network for pod's netns because of CNI configure error
    • can't allocate IP address because exhausted podCIDR
  • Failed to start container, e.g.
    • cmd or args configure error
    • image itself contains wrong binary

Pod stuck in ImagePullBackOff

ImagePullBackOff means image can't be pulled by a few times of retries. It could be caused by wrong image name or incorrect docker secret. In such case, docker pull <image> could be used to verify whether the image is correct.
1
$ kubectl describe pod mypod
2
...
3
Events:
4
Type Reason Age From Message
5
---- ------ ---- ---- -------
6
Normal Scheduled 36s default-scheduler Successfully assigned sh to k8s-agentpool1-38622806-0
7
Normal SuccessfulMountVolume 35s kubelet, k8s-agentpool1-38622806-0 MountVolume.SetUp succeeded for volume "default-token-n4pn6"
8
Normal Pulling 17s (x2 over 33s) kubelet, k8s-agentpool1-38622806-0 pulling image "a1pine"
9
Warning Failed 14s (x2 over 29s) kubelet, k8s-agentpool1-38622806-0 Failed to pull image "a1pine": rpc error: code = Unknown desc = Error response from daemon: repository a1pine not found: does not exist or no pull access
10
Warning Failed 14s (x2 over 29s) kubelet, k8s-agentpool1-38622806-0 Error: ErrImagePull
11
Normal SandboxChanged 4s (x7 over 28s) kubelet, k8s-agentpool1-38622806-0 Pod sandbox changed, it will be killed and re-created.
12
Normal BackOff 4s (x5 over 25s) kubelet, k8s-agentpool1-38622806-0 Back-off pulling image "a1pine"
13
Warning Failed 1s (x6 over 25s) kubelet, k8s-agentpool1-38622806-0 Error: ImagePullBackOff
Copied!
For private images, a docker registry secret should be created
1
kubectl create secret docker-registry my-secret --docker-server=DOCKER_REGISTRY_SERVER --docker-username=DOCKER_USER --docker-password=DOCKER_PASSWORD --docker-email=DOCKER_EMAIL
Copied!
and then refer the secret in container's spec:
1
spec:
2
containers:
3
- name: private-reg-container
4
image: <your-private-image>
5
imagePullSecrets:
6
- name: my-secret
Copied!

Pod stuck in CrashLoopBackOff

In such case, Pod has been started and then exited abnormally (its restartCount should be > 0). Take a look at the container logs
1
kubectl describe pod <pod-name>
2
kubectl logs <pod-name>
Copied!
If your container has previously crashed, you can access the previous container’s crash log with:
1
kubectl logs --previous <pod-name>
Copied!
From container logs, we may find the reason of crashing, e.g.
  • Container process exited
  • Health check failed
  • OOMKilled
1
$ kubectl describe pod mypod
2
...
3
Containers:
4
sh:
5
Container ID: docker://3f7a2ee0e7e0e16c22090a25f9b6e42b5c06ec049405bc34d3aa183060eb4906
6
Image: alpine
7
Image ID: docker-pullable://[email protected]:7b848083f93822dd21b0a2f14a110bd99f6efb4b838d499df6d04a49d0debf8b
8
Port: <none>
9
Host Port: <none>
10
State: Terminated
11
Reason: OOMKilled
12
Exit Code: 2
13
Last State: Terminated
14
Reason: OOMKilled
15
Exit Code: 2
16
Ready: False
17
Restart Count: 3
18
Limits:
19
cpu: 1
20
memory: 1G
21
Requests:
22
cpu: 100m
23
memory: 500M
24
...
Copied!
Alternately, you can run commands inside that container with exec:
1
kubectl exec cassandra -- cat /var/log/cassandra/system.log
Copied!
If none of these approaches work, SSH to Pod's host and check kubelet or docker's logs. The host running the Pod could be found by running:
1
# Query Node
2
kubectl get pod <pod-name> -o wide
3
4
# SSH to Node
5
ssh <username>@<node-name>
Copied!

Pod stuck in Error

In such case, Pod has been scheduled but failed to start. Again, get information from kubectl describe pod <pod-name> and check what's wrong. Reasons include:
  • referring non-exist ConfigMap, Secret or PV
  • exceeding resource limits (e.g. LimitRange)
  • violating PodSecurityPolicy
  • not authorized to cluster resources (e.g. with RBAC enabled, rolebinding should be created for service account)

Pod stuck in Terminating or Unknown

From v1.5, kube-controller-manager won't delete Pods because of Node unready. Instead, those Pods are marked with Terminating or Unknown status. If you are sure those Pods are not wanted any more, then there are three ways to delete them permanently
  • Delete the node from cluster, e.g. kubectl delete node <node-name>. If you are running with a cloud provider, node should be removed automatically after the VM is deleted from cloud provider.
  • Recover the node. After kubelet restarts, it will check Pods status with kube-apiserver and restarts or deletes those Pods.
  • Force delete the Pods, e.g. kubectl delete pods <pod> --grace-period=0 --force. This way is not recommended, unless you know what you are doing. For Pods belonging to StatefulSet, deleting forcibly may result in data loss or split-brain problem.
For kubelet run in Docker containers, an UnmountVolume.TearDown failed error may be found in kubelet logs:
1
{"log":"I0926 19:59:07.162477 54420 kubelet.go:1894] SyncLoop (DELETE, \"api\"): \"billcenter-737844550-26z3w_meipu(30f3ffec-a29f-11e7-b693-246e9607517c)\"\n","stream":"stderr","time":"2017-09-26T11:59:07.162748656Z"}
2
{"log":"I0926 19:59:39.977126 54420 reconciler.go:186] operationExecutor.UnmountVolume started for volume \"default-token-6tpnm\" (UniqueName: \"kubernetes.io/secret/30f3ffec-a29f-11e7-b693-246e9607517c-default-token-6tpnm\") pod \"30f3ffec-a29f-11e7-b693-246e9607517c\" (UID: \"30f3ffec-a29f-11e7-b693-246e9607517c\") \n","stream":"stderr","time":"2017-09-26T11:59:39.977438174Z"}
3
{"log":"E0926 19:59:39.977461 54420 nestedpendingoperations.go:262] Operation for \"\\\"kubernetes.io/secret/30f3ffec-a29f-11e7-b693-246e9607517c-default-token-6tpnm\\\" (\\\"30f3ffec-a29f-11e7-b693-246e9607517c\\\")\" failed. No retries permitted until 2017-09-26 19:59:41.977419403 +0800 CST (durationBeforeRetry 2s). Error: UnmountVolume.TearDown failed for volume \"default-token-6tpnm\" (UniqueName: \"kubernetes.io/secret/30f3ffec-a29f-11e7-b693-246e9607517c-default-token-6tpnm\") pod \"30f3ffec-a29f-11e7-b693-246e9607517c\" (UID: \"30f3ffec-a29f-11e7-b693-246e9607517c\") : remove /var/lib/kubelet/pods/30f3ffec-a29f-11e7-b693-246e9607517c/volumes/kubernetes.io~secret/default-token-6tpnm: device or resource busy\n","stream":"stderr","time":"2017-09-26T11:59:39.977728079Z"}
Copied!
In such case, kubelet should be configured with option --containerized and its running container should be run with volumes:
1
# Take calico plugin as an example
2
-v /:/rootfs:ro,shared \
3
-v /sys:/sys:ro \
4
-v /dev:/dev:rw \
5
-v /var/log:/var/log:rw \
6
-v /run/calico/:/run/calico/:rw \
7
-v /run/docker/:/run/docker/:rw \
8
-v /run/docker.sock:/run/docker.sock:rw \
9
-v /usr/lib/os-release:/etc/os-release \
10
-v /usr/share/ca-certificates/:/etc/ssl/certs \
11
-v /var/lib/docker/:/var/lib/docker:rw,shared \
12
-v /var/lib/kubelet/:/var/lib/kubelet:rw,shared \
13
-v /etc/kubernetes/ssl/:/etc/kubernetes/ssl/ \
14
-v /etc/kubernetes/config/:/etc/kubernetes/config/ \
15
-v /etc/cni/net.d/:/etc/cni/net.d/ \
16
-v /opt/cni/bin/:/opt/cni/bin/ \
Copied!
Pods in Terminating state should be removed after Kubelet recovery. But sometimes, the Pods may not be deleted automatically and even force deletion (kubectl delete pods <pod> --grace-period=0 --force) doesn't work. In such case, finalizers is probably the cause and remove it with kubelet edit could mitigate the problem.
1
"finalizers": [
2
"foregroundDeletion"
3
]
Copied!

Pod is running but not doing what it should do

If the pod has been running but not behaving as you expected, there may be errors in your pod description. Often a section of the pod description is nested incorrectly, or a key name is typed incorrectly, and so the key is ignored.
Try to recreate the pod with --validate option:
1
kubectl delete pod mypod
2
kubectl create --validate -f mypod.yaml
Copied!
or check whether created pod is expected by getting its description back:
1
kubectl get pod mypod -o yaml
Copied!

Static Pod not recreated after manifest changed

Kubelet monitors changes under /etc/kubernetes/manifests (configured by kubelet's --pod-manifest-path option) directory by inotify. There is possible kubelet missed some events, which results in static Pod not recreated automatically. Restart kubelet should solve the problem.

References

最近更新 2yr ago