Pod Troubleshooting
This chapter discusses methods for troubleshooting Pod issues.
Generally, regardless of the error state of the Pod, the following commands can execute to check the Pod's status:
kubectl get pod <pod-name> -o yaml
to check if the Pod's configuration is correctkubectl describe pod <pod-name>
to review the Pod's eventskubectl logs <pod-name> [-c <container-name>]
to review the container logs
These events and logs typically assist in diagnosing issues with the Pod.
Pods Perpetually in 'Pending' State
'Pending' state indicates that the Pod has not been scheduled on any Node yet. Execute kubectl describe pod <pod-name>
to check the current Pod's events, in order to discern why it has not been scheduled. For instance:
Potential causes include:
Insufficient resources: all Nodes in the cluster that do not meet the CPU, memory, GPU, or temporary storage space resources requested by the Pod. The solution is to delete unused Pods in the cluster or add new Nodes.
The HostPort port is occupied. It is generally recommended to use the Service to expose the service port externally.
Pods Perpetually in 'Waiting' or 'ContainerCreating' State
Start by looking at the current Pod's events using kubectl describe pod <pod-name>
As can be seen, the Sandbox container for the Pod cannot start normally. The specific reason requires checking the Kubelet logs:
From the logs, it's determined that the issue is due to cni0 bridge configured with an IP address from a different network segment. Deleting the bridge (the network plugin will automatically recreate it) fixes the issue.
Other possible causes include:
Image pull failure, for example,
Misconfigured images
Kubelet cannot access the image (specific workaround needed for China's environment to access
gcr.io
)Misconfigured keys for a private image
The image is too large, causing a timeout (you can appropriately adjust the kubelet’s
--image-pull-progress-deadline
and--runtime-request-timeout
options)
CNI network error, requiring a check and possibly adjustment for CNI network plugin’s configuration, for example,
Unable to configure the Pod network
Unable to assign IP address
The container cannot start, check whether the correct image has been packaged or the correct container parameters have been configured
Pods in 'ImagePullBackOff' State
Pods in this state typically indicate a configuration error with the image name or a misconfiguration with the private image's key. In such cases, use docker pull <image>
to test whether the image can be pulled correctly.
If the image is private, a docker-registry type Secret needs to be created first:
Then link this Secret in the container:
Pods Keep Crashing (CrashLoopBackOff State)
The CrashLoopBackOff state means that the container did indeed start, but then it exited abnormally. At this point, the RestartCounts for the Pod is typically greater than 0, and you may want to consider checking the container logs:
From here, there may be some insights as to why the container exited, such as:
Container process exiting
Health check failure
OOMKilled (Out of Memory)
If clues are still lacking, you can further investigate the reasons for exiting by executing commands inside the container:
If no hints are still found, SSH login to the Node where the Pod is located is advised, to further delve into Kubelet or Docker logs:
Pods in 'Error' State
Typically, when in the 'Error' state, it indicates that an error occurred during the Pod startup process. Common causes include:
Dependencies such as ConfigMap, Secret, or PV do not exist
Requested resources exceed the limitations set by the administrator, such as exceeding LimitRange, etc.
Violation of the cluster's security policy, such as PodSecurityPolicy, etc.
The container does not have the authority to operate resources within the cluster, for instance, after opening RBAC, role binding needs to be configured for ServiceAccount
Pods in 'Terminating' or 'Unknown' State
From version v1.5 onwards, Kubernetes will no longer delete Pods running on its own due to Node failure, instead, it marks them as 'Terminating' or 'Unknown'. There are three methods to delete Pods in these states:
Remove the Node of concern from the cluster. When using public clouds, kube-controller-manager will automatically delete the corresponding Node after the VM is deleted. For clusters deployed on physical machines, administrators will need to manually delete the Node (
kubectl delete node <node-name>
).Node recovery. Kubelet will communicate with kube-apiserver to determine the expected state of these Pods, and then decide to delete or continue running these Pods.
Forced deletion by the user. The user can execute
kubectl delete pods <pod> --grace-period=0 --force
to forcefully delete the Pod. Unless it is clear that the Pod is indeed in a stopped state (such as when the VM or physical machine where the Node is located has been shut down), this method is not recommended. Especially for Pods managed by StatefulSet, forced deletion can easily lead to problems such as split-brain or data loss.
If Kubelet runs in the form of a Docker container, you may find the following error in the kubelet logs:
For this scenario, set the --containerized
parameter for the kubelet container and pass in the following volumes:
Pods in the Terminating
state are usually automatically deleted after Kubelet resumes normal operation. However, sometimes there may be situations where they cannot be deleted and forcing deletion using kubectl delete pods <pod> --grace-period=0 --force
does not work either. In this case, it is generally caused by finalizers
, and deleting the finalizers through kubectl edit
can resolve the issue.
Pod troubleshooting diagram
(From A visual guide on troubleshooting Kubernetes deployments)
References
最后更新于