Kubernetes provides a robust platform for configuring or managing containers at large scale. Kubernetes is a complex system that requires extensive monitoring to identify and debug production issues. Kubernetes monitoring tools can help you effectively monitor your Kubernetes deployments in production.
This article discusses how to effectively monitor Kubernetes, and popular tools that can help you monitor your Kubernetes system.
This is part of our series of articles on cloud monitoring.
When running Kubernetes, it is not enough to have monitoring at the cluster level. You must set up monitoring at every layer of the Kubernetes system, from physical nodes, through to pods, clusters, and the control plane.
The following metrics are critical to monitor to see the health of your Kubernetes infrastructure.
CPU usage—helps you understand resource consumption of the host, as well as individual processes. You can also monitor I/O processes to identify delays in access to remote storage.
Disk usage—for certain Kubernetes services, like the etcd which stores cluster configuration, and datastores, a shortage of disk space is a severe problem. It can lead to write failures, which can cause data corruption and failure.
Pod resources—it’s important to know the resources a pod needs to run. This lets you configure Kubernetes scheduler effectively, and ensure pods are placed on nodes that have ample available resources. Design your deployment in such a way that even if several nodes fail, the remaining ones can still run the required number of pods.
All services (including etcd) that make up the Kubernetes Master or any worker nodes, are critical to the state of the application. Your monitoring system must be able to pick up an error in any of these components, and an error occurs, automatically correct it or send an alert.
You must also monitor Kubernetes resources directly. Kubernetes also provides valuable metrics on resources state and usage, and lets you directly monitor applications. You can trust Kubernetes to achieve the desired state as defined in the cluster configuration, but if it doesn't, teams must receive an alert and intervene in the provisioning process to solve the problem.
Top 5 Kubernetes Monitoring Tools
The following are some of the most commonly used tools to monitor Kubernetes in production.
cAdvisor is a utility which is built into the kubelet, so it is available in almost all Kubernetes systems. It provides information about resource usage of running containers.
Its main features are:
Auto discovery—automatically scans all containers on a specific node and collects statistics such as memory, CPU, network and file system usage.
Overall machine usage—provides resource data for the physical host by evaluating the "root" container.
Export data—integrates with storage plugins like ElasticSearch, and can directly export monitoring data.
Web UI—shows live metrics for all containers on the physical machine.
Prometheus is a robust, enterprise-grade monitoring tool for containerized environments. It is customizable, and provides comprehensive metrics without affecting system performance on monitored containers.
The main features of Prometheus are:
Multidimensional data model—time series data with metric name and key/value pairs, recorded via extract HTTP model
A flexible query language, PromQL, that can help find insights in metrics data
Deployed as independent nodes with no dependency on distributed storage
Can discover monitoring targets using Kubernetes service discovery
Provides a UI and integrates with visualization tools like Grafana
Elastic Stack (formerly called ELK stack) is a set of open source products that allow users to search, analyze, and visualize data in any format from any source in real time. It consists of three main components: Elasticsearch, Logstash, and Kibana.
Main features of Elastic Stack:
Visualization—provides visualizations of the data in the Elasticsearch index. Kibana visualizations are based on Elasticsearch queries. You can use Elasticsearch aggregations to extract and process data to create graphs showing important trends.
Scalability and resilience—operates in a distributed environment, in a clustered topology that grows with demand.
Management tools—provides UIs and APIs giving you full control over data, users, and cluster management.
Alerts—use the Elasticsearch query language to identify any important data change and alert operations teams.
Sensu Go is a telemetry and service health solution for multi-cloud monitoring. It provides in-depth understanding of how servers, containers, services, applications, and connected devices operate in public or private clouds. Sensu can run in parallel with Prometheus or natively without Prometheus.
Main features of Sensu Go:
Custom health checks and metrics—runs custom scripts or Nagios-like plugins. Collects system metrics such as CPU, memory, and disk usage.
Cloud endpoint management—provides agents that automatically identify and register virtual machines, public cloud computing instances, containers, and services running on them.
Assists with self healing—can be used to trigger a service restart, run custom scripts, or submit Ansible jobs upon failure. Can also initiative corrective action through third-party APIs.
Monitoring as code—pre-configured templates let you define a code-free workflow with all the flexibility of monitoring as code solutions.
Smart alerts—supports incident response by sending alerts via email, Slack, SMS, and other channels. Integrates with PagerDuty, ServiceNow and JIRA, with deduplication function to reduce alarm fatigue, event filters and contact routing.
Fluentd provides a unified logging layer for cloud native environments, separating data sources from backend systems. This layer allows you to collect logs in real time as they are created.
Main features of Fluentd include:
JSON data structure—unified processing for log data across the container environment, including filters, buffers and storage.
Extensible architecture—plug-in system allows you to connect multiple data sources and outputs to extend Fluentd's capabilities.
Lightweight—Fluentd instances require only 30-40MB of RAM and can handle 13,000 events per second.
Reliable—supports both memory and file-based buffering to prevent data loss between nodes, and can support high availability and failover.
Kubernetes Monitoring with NetApp Cloud Insights
NetApp Cloud Insights is an infrastructure monitoring tool that gives you visibility into your complete infrastructure. With Cloud Insights, you can monitor, troubleshoot and optimize all your resources including your public clouds and your private data centers.
Cloud Insights helps you find problems fast before they impact your business. Optimize usage so you can defer spend, do more with your limited budgets, detect ransomware attacks before it’s too late and easily report on data access for security compliance auditing.
In particular, NetApp Cloud Insights helps you gain an understanding of your Kubernetes architecture through topology visualization, and monitor health of Kubernetes clusters.