Monitoring
最后更新于
最后更新于
The Kubernetes community offers a series of tools for monitoring the status of containers and clusters, and, with the help of Prometheus, alarm functionality is provided.
cAdvisor is responsible for container and node resource usage statistics within a single node, built-in within Kubelet, and provides an API externally through Kubelet's /metrics/cadvisor
is an open-source distributed time series, event, and metrics database; , on the other hand, is the Dashboard for InfluxDB, offering powerful chart display capabilities. They are often used in combination to display graphically visualized monitoring data.
provides resource monitoring data for the entire cluster, but note that
The Metrics API can only query current metric data and does not save historical data
The Metrics API URI is /apis/metrics.k8s.io/
and maintained at
metrics-server
must be deployed to use this API, and metrics-server obtains data by invoking the Kubelet Summary API
provides metrics for Kubernetes resource objects (such as DaemonSet, Deployments, etc.).
is another monitoring and time-series database, which also provides alarm functionality.
monitors issues with the Node itself, such as hardware, kernel, or runtime problems.
(deprecated) provided resource monitoring across the entire cluster and supported persistent data storage into backends like InfluxDB (deprecated)
is a container monitoring tool from Google and is also the built-in container resource collection tool in Kubelet. It automatically collects resource usage statistics for CPU, memory, network, and file systems of containers on the local machine and provides cAdvisor's native API externally (default port is --cadvisor-port=4194
).
Starting from v1.7, Kubelet metrics API no longer includes cadvisor metrics but provides an independent API interface:
Kubelet metrics: http://127.0.0.1:8001/api/v1/proxy/nodes/<node-name>/metrics
Cadvisor metrics: http://127.0.0.1:8001/api/v1/proxy/nodes/<node-name>/metrics/cadvisor
Thus, in tools like Prometheus, the new Metrics API must be used to obtain this data, as in the following Prometheus configuration that automatically sets up the cadvisor metrics API:
Note: The port monitored by cadvisor will be removed in v1.12, and it is recommended that all external tools use the Kubelet Metrics API instead.
Heapster is recommended only for Kubernetes v1.7.X or older clusters.
Starting from Kubernetes v1.8, resource usage metrics (such as CPU and memory usage of containers) are obtained through the Metrics API, and HPA also queries necessary data from the metrics-server.
Heapster first queries all Node information from the Kubernetes apiserver, then collects node and container resource usage from the kubelet-provided API, while providing Prometheus format data through the /metrics
API. Heapster-collected data can be pushed to various persistence backend storages, such as InfluxDB, Google Cloud Monitoring, OpenTSDB, etc.
After Kubernetes deployment is successful, services such as the dashboard, DNS, and monitoring are also typically deployed by default, such as via cluster/kube-up.sh
:
Note that to access these services, the apiserver certificate must be imported into the browser first for authentication. The visiting process can also be simplified by using the kubectl proxy (no certificate import needed):
Then, open http://<master-ip>:8080/api/v1/proxy/namespaces/kube-system/services/monitoring-grafana
to access Grafana.
Access Prometheus via port forwarding, like kubectl --namespace monitoring port-forward service/kube-prometheus-prometheus :9090
If the exporter-kubelets feature is not working properly, such as reporting a server returned HTTP status 401 Unauthorized
error, webhook authentication needs to be configured for the Kubelet:
If you see K8SControllerManagerDown and K8SSchedulerDown alerts, it means that kube-controller-manager and kube-scheduler are running as Pods in the cluster and the labels of the monitoring services deployed by prometheus do not match theirs. The problem can be solved by modifying the service labels, such as
Query the admin password for Grafana
Then, access the Grafana interface via port forwarding
Add a Prometheus-type Data Source, fill in the original address http://prometheus-prometheus-server.monitoring
.
Kubernetes nodes may experience various hardware, kernel, or runtime issues that could potentially lead to service anomalies. Node Problem Detector (NPD) is a service designed to monitor these anomalies. NPD runs as a DaemonSet on each Node, updating the NodeCondition (such as KernelDaedlock, DockerHung, BadDisk, etc.) or Node Event (such as OOM Kill, etc.) when anomalies occur.
Nodes in Kubernetes clusters typically enable automatic security updates, which helps to minimize losses due to system vulnerabilities. However, updates involving the kernel generally require a system reboot to take effect. At this point, manual or automatic methods are needed to reboot nodes.
Monitors /var/run/reboot-required
signal to reboot nodes
Restarts one node at a time using DaemonSet Annotation
Evicts nodes before rebooting and resumes scheduling afterwards
Cancels reboot based on Prometheus alerts (e.g., --alert-filter-regexp=^(RebootRequired|AnotherBenignAlert|...$
)
Slack notifications
Deployment method
In addition to the above monitoring tools, there are many other open source or commercial systems available to assist with monitoring, such as
sysdig is a container troubleshooting tool that offers both open source and commercial versions. For regular troubleshooting, the open source version suffices.
Aside from sysdig, there are two other auxiliary tools
csysdig: Automatically installed with sysdig, provides a command-line interface
Install sysdig
Usage examples
Weave Scope is another visual container monitoring and troubleshooting tool. Unlike sysdig, it does not have a powerful command-line tool but does offer a straightforward and user-friendly interactive interface that automatically outlines the entire cluster's topology and can be extended by plugins. From its official website description, its features include
Probe is responsible for collecting container and host information and sending it to the App
App processes this information, generates corresponding reports, and displays them in an interactive interface
Install Weave Scope
After installation, the interactive interface can be accessed through the weave-scope-app
You can also view real-time status and metric data of all containers in the Pod by clicking on the Pod:
Now, let's move on to the rephrased version to make it more accessible to a broad audience as a popular science article.
The Kubernetes community is like a vibrant ecosystem with a toolbox that helps you peek into the health and state of your containerized applications and clusters. Plus, thanks to Prometheus, you can even get a virtual tap on the shoulder with alerts if anything goes awry.
Here's the lowdown on the tools you can strap to your Kubernetes utility belt:
cAdvisor is your on-site inspector, built-in with the Kubelet, keeping tabs on resource consumption for containers and nodes, and chatting up the world with its metrics API.
**[Node Problem Detector](
is an open-source distributed time series, event, and metrics database; Grafana is InfluxDB's Dashboard, providing powerful chart display capabilities. They are often used in combination to display graphically visualized monitoring data.
Kubelet's built-in cAdvisor only provides single-machine container resource usage statistics, whereas provides whole-cluster resource monitoring and supports persistent data storage into backends like InfluxDB, Google Cloud Monitoring, or . Note:
Heapster has been deprecated in v1.11, and it is recommended to deploy instead of Heapster for versions v1.8 and above
If these services have not been automatically deployed, they can be deployed following the :
is another monitoring and time-series database and provides alarm functionality as well. It offers a powerful query language and HTTP interface and also supports data export to Grafana.
Using Prometheus to monitor Kubernetes requires proper data source configuration, a simple example is .
It is recommended to use or to deploy and manage Prometheus, such as
Note: Prometheus Operator does not support service discovery through the prometheus.io/scrape
annotation and requires you to define to fetch service metrics.
Refer to to deploy NPD, or you can use Helm for deployment:
is such a daemon that
: Provides a graphical interface for sysdig-saved trace files (e.g., sudo sysdig -w filename.scap
) (not real-time)
Weave Scope consists of
Pair and , and you get a dynamic duo providing not just a robust time-series database but also snazzy dashboards to visualize that precious monitoring data.
The is the cluster's main data cruncher, but remember, it's all about the here and now—no dwelling on the past with historical data.
If you’re curious about the state of your Kubernetes resources, is your go-to for up-to-the-moment metrics.
is like the Swiss Army knife in the toolbox—an observant monitoring system and a time-series database, coupled with an alarm bell to alert you.