Network Troubleshooting
This chapter primarily introduces various common network problems and their troubleshooting methods, including Pod access anomalies, Service access exceptions, and network security policy exceptions, and so on.
When we talk about Kubernetes’ network, it usually falls into one of the following three situations:
Pod accessing the network outside the container
Accessing the Pod network from outside the container
Inter-Pod access
Of course, each of the above scenarios also includes local access and cross-host access. In most cases, Pods are accessed indirectly through Services.
Locating network problems is basically done from these scenarios, pinning down the specific network anomaly points, and then seeking solutions. There are many possible reasons for network anomalies, common ones include:
Misconfiguration of CNI network plugins resulting in multiple hosts not being accessible. For example,
IP segment conflicts with the existing network
The used plugin employs a protocol that is not supported by the underlying network
Forgetting to enable IP forwarding, etc.
sysctl net.ipv4.ip_forward
sysctl net.bridge.bridge-nf-call-iptables
Pod network routing loss. For instance,
kubenet requires a route from podCIDR to the host IP address in the network. If these routes are not properly configured, problems with Pod network communication can arise.
On public cloud platforms, the kube-controller-manager will automatically configure routes for all Nodes, but incorrect configurations (such as authentication authorization failures and exceeding quotas) may also prevent route configuration.
Service NodePort and health probe port conflict
In clusters before version 1.10.4, there may be instances where the NodePort and health probe ports of different Services overlap (this issue has been fixed in kubernetes#64468).
Security groups, firewalls, or security policies within the host or cloud platform could be blocking the Pod network. For example,
Non-Kubernetes managed iptables rules could be blocking the Pod network
Public cloud platform security groups blocking the Pod network (note that the Pod network may not be in the same network segment as the Node network)
Switch or router ACL blocking the Pod network
Flannel Pods Constantly in Init:CrashLoopBackOff State
Deploying the Flannel network plugin is effortless, requiring only one command:
However, after deployment, the Flannel Pod might encounter an initialization failure error.
Looking at the logs, you will find:
This is generally due to SELinux being enabled. It can be resolved by disabling SELinux. There are two ways to do this:
Modify the
/etc/selinux/config
file:SELINUX=disabled
Temporarily modify it using the command (these changes would be lost after a reboot):
setenforce 0
Pod Unable to Allocate IP
The Pod is constantly in the ContainerCreating state. Examining events reveal that the network plugin cannot assign it an IP:
Checking the status of the network plugin's IP allocation, it turns out the IP addresses have indeed all been allocated, but the number of Pods truly in Running state is minimal:
There are two possible reasons for this:
It could be an issue with the network plugin itself whereby the IP is not released after the Pod is stopped.
The speed at which the Pod is recreated could be faster than the rate at which the Kubelet calls the CNI plugin to recycle the network (when garbage collecting, it will first call CNI to clean up the network before deleting the stopped Pod).
For the first problem, it is best to contact the plugin developer to inquire about the fix or a temporary resolution. Of course, if you are well-versed with the working principle of the network plugin, you can consider manually releasing unused IP addresses, such as:
Stop the Kubelet
Locate the file where the IPAM plugin stores the assigned IP addresses, such as
/var/lib/cni/networks/cbr0
(flannel) or/var/run/azure-vnet-ipam.json
(Azure CNI), etc.Query the IPs currently used by the container, such as
kubectl get pod -o wide --all-namespaces | grep <node-name>
Compare the two lists, delete unused IP addresses from the IPAM file, and manually delete related virtual network cards and network namespaces (if any).
Restart the Kubelet
For the second issue, you can configure faster garbage collection for the Kubelet, such as:
Pod Unable to Resolve DNS
If the Docker version installed on the Node is higher than 1.12, Docker will change the default iptables FORWARD policy to DROP. This will create problems for Pod network access. The solution is to run iptables -P FORWARD ACCEPT
on each Node, such as:
If you are using the flannel/weave network plugins, upgrading to the latest version can also solve this problem.
Aside from this, there are many other reasons causing DNS resolution failure:
(1) DNS resolution failure may also be caused by kube-dns service anomalies. The following command can be used to check if kube-dns is running normally:
If kube-dns is in the CrashLoopBackOff state, you can refer to Kube-dns/Dashboard CrashLoopBackOff Troubleshooting to view specific troubleshooting methods.
(2) If the kube-dns Pod is in a normal Running state, you need to check further if the kube-dns service has been correctly configured:
If the kube-dns service is absent or the endpoints list is empty, it indicates that the kube-dns service configuration is erroneous. You can recreate the kube-dns service, such as:
(3) If you have recently upgraded CoreDNS and are using the proxy plugin of CoreDNS, please note that versions 1.5.0 and above require replacing the proxy plugin with the forward plugin. For example:
(4) If the kube-dns Pod and Service are both functioning properly, then it is necessary to check whether kube-proxy has correctly configured load balancing iptables rules for kube-dns. The specific troubleshooting methods can be referred to in the section "Service cannot be accessed" below.
Slow DNS resolution
Due to a bug in the kernel, the connection tracking module experiences competition, resulting in slow DNS resolution. The community is tracking the issue at https://github.com/kubernetes/kubernetes/issues/56903.
Temporary solution: Configure options single-request-reopen
for containers to avoid concurrent DNS requests with the same five-tuple:
Alternatively, configure dnsConfig for Pods:
Note: The single-request-reopen
option is ineffective on Alpine. Please use other base images like Debian or refer to the fix methods below.
Repair method: Upgrade the kernel and ensure that the following two patches are included.
netfilter: nf_nat: skip nat clash resolution for same-origin entries (included since kernel v5.0)
netfilter: nf_conntrack: resolve clash for matching conntracks (included since kernel v4.19)
For Azure, this issue has been fixed in v4.15.0-1030.31/v4.18.0-1006.6 (patch1, patch2).
Other possible reasons and fixes include:
Having both Kube-dns and CoreDNS present at the same time can cause issues, so only keep one.
Slow DNS resolution may occur if the resource limits for kube-dns or CoreDNS are too low. In this case, increase the resource limits.
Configure the DNS option
use-vc
to force using TCP protocol for sending DNS queries.Run a DNS caching service on each node and set all container's DNS nameservers to point to that cache.
It is recommended to deploy Nodelocal DNS Cache extension to solve this problem and improve DNS resolution performance. Please refer to https://github.com/kubernetes/kubernetes/tree/master/cluster/addons/dns/nodelocaldns for deployment steps of Nodelocal DNS Cache.
For more methods of customizing DNS configuration, please refer to Customizing DNS Service.
Service cannot be accessed
When accessing the Service ClusterIP fails, you can first confirm if there are corresponding Endpoints.
If the list is empty, it may be due to an incorrect LabelSelector configuration for this Service. You can use the following method to confirm:
If the Endpoints are normal, you can further check:
Whether Pod's containerPort corresponds to Service's containerPort.
Whether direct access to
podIP:containerPort
is normal.
Furthermore, even if all of the above configurations are correct and error-free, there may be other reasons causing issues with accessing the Service, such as:
The containers inside Pods may not be running properly or not listening on the specified containerPort.
CNI network or host routing abnormalities can also cause similar problems.
The kube-proxy service may not have started or configured corresponding iptables rules correctly. For example, under normal circumstances, a service named
hostnames
will configure the following iptables rules.
Pod cannot access itself through Service
This is usually caused by hairpin configuration errors, which can be configured through the --hairpin-mode
option of Kubelet. Optional parameters include "promiscuous-bridge", "hairpin-veth", and "none" (default is "promiscuous-bridge").
For the hairpin-veth mode, you can confirm if it takes effect with the following command:
And for the promiscuous-bridge mode, you can confirm if it takes effect with the following command:
Unable to access Kubernetes API
Many extension services need to access the Kubernetes API to query the required data (such as kube-dns, Operator, etc.). Usually, when unable to access the Kubernetes API, you can first verify that the Kubernetes API is functioning properly using the following command:
If a timeout error occurs, further confirmation is needed to ensure that the service named kubernetes
and the list of endpoints are normal.
Then you can directly access the endpoints to check if kube-apiserver can be accessed normally. If it cannot be accessed, it usually means that kube-apiserver is not started properly or there are firewall rules blocking the access.
However, if a 403 - Forbidden
error occurs, it indicates that the Kubernetes cluster has enabled access authorization control (such as RBAC). In this case, you need to create roles and role bindings for the ServiceAccount used by Pods to authorize access to the required resources. For example, CoreDNS needs to create the following ServiceAccount and role binding:
Kernel Problems
In addition to the above issues, there may also be errors in accessing services or timeouts caused by kernel problems, such as:
Failure to allocate ports for SNAT due to not setting
--random-fully
, resulting in service access timeout. Note that Kubernetes currently does not have the--random-fully
option set for SNAT. If you encounter this issue, you can refer to here for configuration.
References
最后更新于