Pod Troubleshooting
This chapter discusses methods for troubleshooting Pod issues.
Generally, regardless of the error state of the Pod, the following commands can execute to check the Pod's status:
kubectl get pod <pod-name> -o yaml
to check if the Pod's configuration is correctkubectl describe pod <pod-name>
to review the Pod's eventskubectl logs <pod-name> [-c <container-name>]
to review the container logs
These events and logs typically assist in diagnosing issues with the Pod.
Pods Perpetually in 'Pending' State
'Pending' state indicates that the Pod has not been scheduled on any Node yet. Execute kubectl describe pod <pod-name>
to check the current Pod's events, in order to discern why it has not been scheduled. For instance:
$ kubectl describe pod mypod
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 12s (x6 over 27s) default-scheduler 0/4 nodes are available: 2 Insufficient cpu.
Potential causes include:
Insufficient resources: all Nodes in the cluster that do not meet the CPU, memory, GPU, or temporary storage space resources requested by the Pod. The solution is to delete unused Pods in the cluster or add new Nodes.
The HostPort port is occupied. It is generally recommended to use the Service to expose the service port externally.
Pods Perpetually in 'Waiting' or 'ContainerCreating' State
Start by looking at the current Pod's events using kubectl describe pod <pod-name>
$ kubectl -n kube-system describe pod nginx-pod
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 1m default-scheduler Successfully assigned nginx-pod to node1
Normal SuccessfulMountVolume 1m kubelet, gpu13 MountVolume.SetUp succeeded for volume "config-volume"
Normal SuccessfulMountVolume 1m kubelet, gpu13 MountVolume.SetUp succeeded for volume "coredns-token-sxdmc"
Warning FailedSync 2s (x4 over 46s) kubelet, gpu13 Error syncing pod
Normal SandboxChanged 1s (x4 over 46s) kubelet, gpu13 Pod sandbox changed, it will be killed and re-created.
As can be seen, the Sandbox container for the Pod cannot start normally. The specific reason requires checking the Kubelet logs:
$ journalctl -u kubelet
...
Mar 14 04:22:04 node1 kubelet[29801]: E0314 04:22:04.649912 29801 cni.go:294] Error adding network: failed to set bridge addr: "cni0" already has an IP address different from 10.244.4.1/24
Mar 14 04:22:04 node1 kubelet[29801]: E0314 04:22:04.649941 29801 cni.go:243] Error while adding to cni network: failed to set bridge addr: "cni0" already has an IP address different from 10.244.4.1/24
Mar 14 04:22:04 node1 kubelet[29801]: W0314 04:22:04.891337 29801 cni.go:258] CNI failed to retrieve network namespace path: Cannot find network namespace for the terminated container "c4fd616cde0e7052c240173541b8543f746e75c17744872aa04fe06f52b5141c"
Mar 14 04:22:05 node1 kubelet[29801]: E0314 04:22:05.965801 29801 remote_runtime.go:91] RunPodSandbox from runtime service failed: rpc error: code = 2 desc = NetworkPlugin cni failed to set up pod "nginx-pod" network: failed to set bridge addr: "cni0" already has an IP address different from 10.244.4.1/24
From the logs, it's determined that the issue is due to cni0 bridge configured with an IP address from a different network segment. Deleting the bridge (the network plugin will automatically recreate it) fixes the issue.
$ ip link set cni0 down
$ brctl delbr cni0
Other possible causes include:
Image pull failure, for example,
Misconfigured images
Kubelet cannot access the image (specific workaround needed for China's environment to access
gcr.io
)Misconfigured keys for a private image
The image is too large, causing a timeout (you can appropriately adjust the kubelet’s
--image-pull-progress-deadline
and--runtime-request-timeout
options)
CNI network error, requiring a check and possibly adjustment for CNI network plugin’s configuration, for example,
Unable to configure the Pod network
Unable to assign IP address
The container cannot start, check whether the correct image has been packaged or the correct container parameters have been configured
Pods in 'ImagePullBackOff' State
Pods in this state typically indicate a configuration error with the image name or a misconfiguration with the private image's key. In such cases, use docker pull <image>
to test whether the image can be pulled correctly.
$ kubectl describe pod mypod
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 36s default-scheduler Successfully assigned sh to k8s-agentpool1-38622806-0
Normal SuccessfulMountVolume 35s kubelet, k8s-agentpool1-38622806-0 MountVolume.SetUp succeeded for volume "default-token-n4pn6"
Normal Pulling 17s (x2 over 33s) kubelet, k8s-agentpool1-38622806-0 pulling image "a1pine"
Warning Failed 14s (x2 over 29s) kubelet, k8s-agentpool1-38622806-0 Failed to pull image "a1pine": rpc error: code = Unknown desc = Error response from daemon: repository a1pine not found: does not exist or no pull access
Warning Failed 14s (x2 over 29s) kubelet, k8s-agentpool1-38622806-0 Error: ErrImagePull
Normal SandboxChanged 4s (x7 over 28s) kubelet, k8s-agentpool1-38622806-0 Pod sandbox changed, it will be killed and re-created.
Normal BackOff 4s (x5 over 25s) kubelet, k8s-agentpool1-38622806-0 Back-off pulling image "a1pine"
Warning Failed 1s (x6 over 25s) kubelet, k8s-agentpool1-38622806-0 Error: ImagePullBackOff
If the image is private, a docker-registry type Secret needs to be created first:
kubectl create secret docker-registry my-secret --docker-server=DOCKER_REGISTRY_SERVER --docker-username=DOCKER_USER --docker-password=DOCKER_PASSWORD --docker-email=DOCKER_EMAIL
Then link this Secret in the container:
spec:
containers:
- name: private-reg-container
image: <your-private-image>
imagePullSecrets:
- name: my-secret
Pods Keep Crashing (CrashLoopBackOff State)
The CrashLoopBackOff state means that the container did indeed start, but then it exited abnormally. At this point, the RestartCounts for the Pod is typically greater than 0, and you may want to consider checking the container logs:
kubectl describe pod <pod-name>
kubectl logs <pod-name>
kubectl logs --previous <pod-name>
From here, there may be some insights as to why the container exited, such as:
Container process exiting
Health check failure
OOMKilled (Out of Memory)
$ kubectl describe pod mypod
...
Containers:
sh:
Container ID: docker://3f7a2ee0e7e0e16c22090a25f9b6e42b5c06ec049405bc34d3aa183060eb4906
Image: alpine
Image ID: docker-pullable://alpine@sha256:7b848083f93822dd21b0a2f14a110bd99f6efb4b838d499df6d04a49d0debf8b
Port: <none>
Host Port: <none>
State: Terminated
Reason: OOMKilled
Exit Code: 2
Last State: Terminated
Reason: OOMKilled
Exit Code: 2
Ready: False
Restart Count: 3
Limits:
cpu: 1
memory: 1G
Requests:
cpu: 100m
memory: 500M
...
If clues are still lacking, you can further investigate the reasons for exiting by executing commands inside the container:
kubectl exec cassandra -- cat /var/log/cassandra/system.log
If no hints are still found, SSH login to the Node where the Pod is located is advised, to further delve into Kubelet or Docker logs:
# Query Node
kubectl get pod <pod-name> -o wide
# SSH to Node
ssh <username>@<node-name>
Pods in 'Error' State
Typically, when in the 'Error' state, it indicates that an error occurred during the Pod startup process. Common causes include:
Dependencies such as ConfigMap, Secret, or PV do not exist
Requested resources exceed the limitations set by the administrator, such as exceeding LimitRange, etc.
Violation of the cluster's security policy, such as PodSecurityPolicy, etc.
The container does not have the authority to operate resources within the cluster, for instance, after opening RBAC, role binding needs to be configured for ServiceAccount
Pods in 'Terminating' or 'Unknown' State
From version v1.5 onwards, Kubernetes will no longer delete Pods running on its own due to Node failure, instead, it marks them as 'Terminating' or 'Unknown'. There are three methods to delete Pods in these states:
Remove the Node of concern from the cluster. When using public clouds, kube-controller-manager will automatically delete the corresponding Node after the VM is deleted. For clusters deployed on physical machines, administrators will need to manually delete the Node (
kubectl delete node <node-name>
).Node recovery. Kubelet will communicate with kube-apiserver to determine the expected state of these Pods, and then decide to delete or continue running these Pods.
Forced deletion by the user. The user can execute
kubectl delete pods <pod> --grace-period=0 --force
to forcefully delete the Pod. Unless it is clear that the Pod is indeed in a stopped state (such as when the VM or physical machine where the Node is located has been shut down), this method is not recommended. Especially for Pods managed by StatefulSet, forced deletion can easily lead to problems such as split-brain or data loss.
If Kubelet runs in the form of a Docker container, you may find the following error in the kubelet logs:
{"log":"I0926 19:59:07.162477 54420 kubelet.go:1894] SyncLoop (DELETE, \"api\"): \"billcenter-737844550-26z3w_meipu(30f3ffec-a29f-11e7-b693-246e9607517c)\"\n","stream":"stderr","time":"2017-09-26T11:59:07.162748656Z"}
{"log":"I0926 19:59:39.977126 54420 reconciler.go:186] operationExecutor.UnmountVolume started for volume \"default-token-6tpnm\" (UniqueName: \"kubernetes.io/secret/30f3ffec-a29f-11e7-b693-246e9607517c-default-token-6tpnm\") pod \"30f3ffec-a29f-11e7-b693-246e9607517c\" (UID: \"30f3ffec-a29f-11e7-b693-246e9607517c\") \n","stream":"stderr","time":"2017-09-26T11:59:39.977438174Z"}
{"log":"E0926 19:59:39.977461 54420 nestedpendingoperations.go:262] Operation for \"\\\"kubernetes.io/secret/30f3ffec-a29f-11e7-b693-246e9607517c-default-token-6tpnm\\\" (\\\"30f3ffec-a29f-11e7-b693-246e9607517c\\\")\" failed. No retries permitted until 2017-09-26 19:59:41.977419403 +0800 CST (durationBeforeRetry 2s). Error: UnmountVolume.TearDown failed for volume \"default-token-6tpnm\" (UniqueName: \"kubernetes.io/secret/30f3ffec-a29f-11e7-b693-246e9607517c-default-token-6tpnm\") pod \"30f3ffec-a29f-11e7-b693-246e9607517c\" (UID: \"30f3ffec-a29f-11e7-b693-246e9607517c\") : remove /var/lib/kubelet/pods/30f3ffec-a29f-11e7-b693-246e9607517c/volumes/kubernetes.io~secret/default-token-6tpnm: device or resource busy\n","stream":"stderr","time":"2017-09-26T11:59:39.977728079Z"}
For this scenario, set the --containerized
parameter for the kubelet container and pass in the following volumes:
# Example using calico network plugin
-v /:/rootfs:ro,shared \
-v /sys:/sys:ro \
-v /dev:/dev:rw \
-v /var/log:/var/log:rw \
-v /run/calico/:/run/calico/:rw \
-v /run/docker/:/run/docker/:rw \
-v /run/docker.sock:/run/docker.sock:rw \
-v /usr/lib/os-release:/etc/os-release \
-v /usr/share/ca-certificates/:/etc/ssl/certs \
-v /var/lib/docker/:/var/lib/docker:rw,shared \
-v /var/lib/kubelet/:/var/lib/kubelet:rw,shared \
-v /etc/kubernetes/ssl/:/etc/kubernetes/ssl/ \
-v /etc/kubernetes/config/:/etc/kubernetes/config/ \
-v /etc/cni/net.d/:/etc/cni/net.d/ \
-v /opt/cni/bin/:/opt/cni/bin/ \
Pods in the Terminating
state are usually automatically deleted after Kubelet resumes normal operation. However, sometimes there may be situations where they cannot be deleted and forcing deletion using kubectl delete pods <pod> --grace-period=0 --force
does not work either. In this case, it is generally caused by finalizers
, and deleting the finalizers through kubectl edit
can resolve the issue.
"finalizers": [
"foregroundDeletion"
]
Pod troubleshooting diagram

(From A visual guide on troubleshooting Kubernetes deployments)
References
最后更新于