Cluster Troubleshooting
This section introduces methods for troubleshooting abnormal cluster conditions, including key components of Kubernetes and essential extensions (like kube-dns). For issues related to network, please refer to the network troubleshooting guide.
Overview
Investigating abnormal cluster states generally begins with examining the status of Node and Kubernetes services, identifying the faulty service, and then seeking a solution. There could be many reasons for abnormal cluster states, including:
Shutdown of virtual or physical machines
Network partitions
Failure of Kubernetes services to start properly
Loss of data or unavailability of persistent storage (commonly on public or private clouds)
Operational mistakes (like configuration errors)
Considering different components, the reasons might include:
Failure to start kube-apiserver, leading to:
Inaccessible clusters
Normal operation of existing Pods and services (except those relying on Kubernetes API)
Anomalies in the etcd cluster, leading to:
kube-apiserver being unable to read or write the cluster status, which leads to errors in accessing the Kubernetes API
kubelet failing to update its status periodically
Erroneous kube-controller-manager/kube-scheduler, leading to:
Inoperative replication controllers, node controllers, cloud service controllers, etc, which leads to inoperative Deployments, Services, and inability to register new Nodes to the cluster
Newly created Pods cannot be scheduled (always in Pending state)
Node itself crashing or failure of Kubelet to start, leading to:
Pods on the Node not operating as expected
Already running Pods unable to terminate properly
Network partitions leading to communication anomalies between Kubelet and the control plane, as well as between Pods
To maintain the health of the cluster, consider the following when deploying a cluster:
Enable VM's automatic restart feature on the cloud platform
Configure a multi-node highly available cluster for Etcd, use persistent storage (like AWS EBS), and back up data regularly
Configure high availability for the control plane, such as load balancing for multiple kube-apiservers, and running multiple nodes of kube-controller-manager, kube-scheduler, kube-dns, etc
Prefer using replication controllers and Services rather than directly managing Pods
Deploy multiple Kubernetes clusters across regions
Checking Node Status
Generally, you can first check the status of the Node and confirm whether the Node is in Ready state.
If the state is NotReady, you can execute kubectl describe node <node-name>
command to examine the current events of the Node. These events are typically helpful in troubleshooting issues on the Node.
SSH Login to Node
During troubleshooting of Kubernetes issues, you usually need to SSH onto the specific Node to check the status and logs of kubelet, docker, iptables, etc. When using a cloud platform, you can bind a public IP to the corresponding VM; for physical deployments, you can access it via port mapping from the router. A simpler method is to use an SSH Pod (remember to replace with your nodeName):
Next, you can log into the Node through the external IP of the ssh service, like ssh user@52.52.52.52
.
After using it, don't forget to delete the SSH service with kubectl delete -f ssh.yaml
.
Viewing Logs
Generally, there are two deployment methods for the main components of Kubernetes:
Utilize systemd etc. for booting control node services
Use Static Pod for managing and booting control node services
When systemd etc. are user for managing control node services, to view logs, you must first SSH login to the machine and then view specific log files. For example:
Or directly view log files:
/var/log/kube-apiserver.log
/var/log/kube-scheduler.log
/var/log/kube-controller-manager.log
/var/log/kubelet.log
/var/log/kube-proxy.log
kube-apiserver logs
kube-controller-manager logs
kube-scheduler logs
kube-dns logs
Kubelet logs
To view Kubelet logs, you need to first SSH login to the Node.
Kube-proxy logs
Kube-proxy is usually deployed as a DaemonSet.
最后更新于