Monitoring
最后更新于
最后更新于
The Kubernetes community offers a series of tools for monitoring the status of containers and clusters, and, with the help of Prometheus, alarm functionality is provided.
cAdvisor is responsible for container and node resource usage statistics within a single node, built-in within Kubelet, and provides an API externally through Kubelet's /metrics/cadvisor
metrics-server provides resource monitoring data for the entire cluster, but note that
The Metrics API can only query current metric data and does not save historical data
The Metrics API URI is /apis/metrics.k8s.io/
and maintained at k8s.io/metrics
metrics-server
must be deployed to use this API, and metrics-server obtains data by invoking the Kubelet Summary API
kube-state-metrics provides metrics for Kubernetes resource objects (such as DaemonSet, Deployments, etc.).
Prometheus is another monitoring and time-series database, which also provides alarm functionality.
Node Problem Detector monitors issues with the Node itself, such as hardware, kernel, or runtime problems.
Heapster (deprecated) provided resource monitoring across the entire cluster and supported persistent data storage into backends like InfluxDB (deprecated)
cAdvisor is a container monitoring tool from Google and is also the built-in container resource collection tool in Kubelet. It automatically collects resource usage statistics for CPU, memory, network, and file systems of containers on the local machine and provides cAdvisor's native API externally (default port is --cadvisor-port=4194
).
Starting from v1.7, Kubelet metrics API no longer includes cadvisor metrics but provides an independent API interface:
Kubelet metrics: http://127.0.0.1:8001/api/v1/proxy/nodes/<node-name>/metrics
Cadvisor metrics: http://127.0.0.1:8001/api/v1/proxy/nodes/<node-name>/metrics/cadvisor
Thus, in tools like Prometheus, the new Metrics API must be used to obtain this data, as in the following Prometheus configuration that automatically sets up the cadvisor metrics API:
Note: The port monitored by cadvisor will be removed in v1.12, and it is recommended that all external tools use the Kubelet Metrics API instead.
InfluxDB is an open-source distributed time series, event, and metrics database; Grafana is InfluxDB's Dashboard, providing powerful chart display capabilities. They are often used in combination to display graphically visualized monitoring data.
Kubelet's built-in cAdvisor only provides single-machine container resource usage statistics, whereas Heapster provides whole-cluster resource monitoring and supports persistent data storage into backends like InfluxDB, Google Cloud Monitoring, or other backends. Note:
Heapster is recommended only for Kubernetes v1.7.X or older clusters.
Starting from Kubernetes v1.8, resource usage metrics (such as CPU and memory usage of containers) are obtained through the Metrics API, and HPA also queries necessary data from the metrics-server.
Heapster has been deprecated in v1.11, and it is recommended to deploy metrics-server instead of Heapster for versions v1.8 and above
Heapster first queries all Node information from the Kubernetes apiserver, then collects node and container resource usage from the kubelet-provided API, while providing Prometheus format data through the /metrics
API. Heapster-collected data can be pushed to various persistence backend storages, such as InfluxDB, Google Cloud Monitoring, OpenTSDB, etc.
After Kubernetes deployment is successful, services such as the dashboard, DNS, and monitoring are also typically deployed by default, such as via cluster/kube-up.sh
:
If these services have not been automatically deployed, they can be deployed following the kubernetes/heapster:
Note that to access these services, the apiserver certificate must be imported into the browser first for authentication. The visiting process can also be simplified by using the kubectl proxy (no certificate import needed):
Then, open http://<master-ip>:8080/api/v1/proxy/namespaces/kube-system/services/monitoring-grafana
to access Grafana.
Prometheus is another monitoring and time-series database and provides alarm functionality as well. It offers a powerful query language and HTTP interface and also supports data export to Grafana.
Using Prometheus to monitor Kubernetes requires proper data source configuration, a simple example is prometheus.yml.
It is recommended to use Prometheus Operator or Prometheus Chart to deploy and manage Prometheus, such as
Access Prometheus via port forwarding, like kubectl --namespace monitoring port-forward service/kube-prometheus-prometheus :9090
If the exporter-kubelets feature is not working properly, such as reporting a server returned HTTP status 401 Unauthorized
error, webhook authentication needs to be configured for the Kubelet:
If you see K8SControllerManagerDown and K8SSchedulerDown alerts, it means that kube-controller-manager and kube-scheduler are running as Pods in the cluster and the labels of the monitoring services deployed by prometheus do not match theirs. The problem can be solved by modifying the service labels, such as
Query the admin password for Grafana
Then, access the Grafana interface via port forwarding
Add a Prometheus-type Data Source, fill in the original address http://prometheus-prometheus-server.monitoring
.
Note: Prometheus Operator does not support service discovery through the
prometheus.io/scrape
annotation and requires you to define ServiceMonitor to fetch service metrics.
Kubernetes nodes may experience various hardware, kernel, or runtime issues that could potentially lead to service anomalies. Node Problem Detector (NPD) is a service designed to monitor these anomalies. NPD runs as a DaemonSet on each Node, updating the NodeCondition (such as KernelDaedlock, DockerHung, BadDisk, etc.) or Node Event (such as OOM Kill, etc.) when anomalies occur.
Refer to kubernetes/node-problem-detector to deploy NPD, or you can use Helm for deployment:
Nodes in Kubernetes clusters typically enable automatic security updates, which helps to minimize losses due to system vulnerabilities. However, updates involving the kernel generally require a system reboot to take effect. At this point, manual or automatic methods are needed to reboot nodes.
Kured (KUbernetes REboot Daemon) is such a daemon that
Monitors /var/run/reboot-required
signal to reboot nodes
Restarts one node at a time using DaemonSet Annotation
Evicts nodes before rebooting and resumes scheduling afterwards
Cancels reboot based on Prometheus alerts (e.g., --alert-filter-regexp=^(RebootRequired|AnotherBenignAlert|...$
)
Slack notifications
Deployment method
In addition to the above monitoring tools, there are many other open source or commercial systems available to assist with monitoring, such as
sysdig is a container troubleshooting tool that offers both open source and commercial versions. For regular troubleshooting, the open source version suffices.
Aside from sysdig, there are two other auxiliary tools
csysdig: Automatically installed with sysdig, provides a command-line interface
sysdig-inspect: Provides a graphical interface for sysdig-saved trace files (e.g., sudo sysdig -w filename.scap
) (not real-time)
Install sysdig
Usage examples
Weave Scope is another visual container monitoring and troubleshooting tool. Unlike sysdig, it does not have a powerful command-line tool but does offer a straightforward and user-friendly interactive interface that automatically outlines the entire cluster's topology and can be extended by plugins. From its official website description, its features include
Weave Scope consists of App and Probe
Probe is responsible for collecting container and host information and sending it to the App
App processes this information, generates corresponding reports, and displays them in an interactive interface
Install Weave Scope
After installation, the interactive interface can be accessed through the weave-scope-app
You can also view real-time status and metric data of all containers in the Pod by clicking on the Pod:
Now, let's move on to the rephrased version to make it more accessible to a broad audience as a popular science article.
The Kubernetes community is like a vibrant ecosystem with a toolbox that helps you peek into the health and state of your containerized applications and clusters. Plus, thanks to Prometheus, you can even get a virtual tap on the shoulder with alerts if anything goes awry.
Here's the lowdown on the tools you can strap to your Kubernetes utility belt:
cAdvisor is your on-site inspector, built-in with the Kubelet, keeping tabs on resource consumption for containers and nodes, and chatting up the world with its metrics API.
The metrics-server is the cluster's main data cruncher, but remember, it's all about the here and now—no dwelling on the past with historical data.
If you’re curious about the state of your Kubernetes resources, kube-state-metrics is your go-to for up-to-the-moment metrics.
Prometheus is like the Swiss Army knife in the toolbox—an observant monitoring system and a time-series database, coupled with an alarm bell to alert you.
**[Node Problem Detector](