hamburger icon close icon

5 Kubernetes Monitoring Best Practices

What are Kubernetes Monitoring Best Practices?

Running containers in Kubernetes has many advantages, but it can be difficult to monitor performance and availability of your constantly changing infrastructure. Kubernetes monitoring best practices are techniques you can use to make monitoring practical and manageable in a cloud native environment.

This is part of our series of articles on cloud monitoring.

In this article, you will learn about the following best practices:

  1. Don’t Track Individual Containers
  2. Track the API Gateway
  3. Track Kube-System Patterns
  4. Monitor Each Layer Separately
  5. Always Alert on High Disk Usage

In addition, you’ll learn about:




Kubernetes Instrumenting Strategies

Before you start monitoring, consider how to instrument your Kubernetes environment, to ensure it records the metrics you’ll need to monitor ongoing operations.

There are three main strategies to collect system events and metrics in Kubernetes:

  • Inject an instrumentation library into your containers
  • Run the instrumentation library as a sidecar container in the same pod—this is especially useful for serverless scenarios
  • Run a monitoring solution as an Extended Berkeley Packet Filter (eBPF) probe


The first two strategies are problematic for several reasons:

  • Resource usage—libraries are loaded once for each running container, which significantly increases RAM usage. Monitoring can increase node density across your cluster, which wastes resources and can lead to stability issues.
  • Limited visibility—because libraries run adjacent to your containers, the amount of data that can be collected is limited by the content of the container. This limits visibility of the monitoring solution and can make it difficult to debug some issues.

eBPF probes can overcome all these issues, because they act as modules in the Linux kernel and can accept system calls from multiple containers. In this way, you can gather all the information you need to troubleshoot, analyze root cause, and monitor performance in your Kubernetes environment.

Other processes running at the user level can combine the eBPF data with other data sources (for example, Prometheus, JMX, or system logs, etc.) and report it to the monitoring backend. eBPF test uses less RAM than embedded monitoring instrumentation, and has little impact on CPU usage or other processes.

Related content: learn more in our in-depth guides to:

  • Infrastructure monitoring (coming soon)
  • SaaS monitoring (coming soon)

5 Kubernetes Monitoring Best Practices

1. Don’t Track Individual Containers

Due to the dynamic changes to Kubernetes resources and the assumption that deployed replicas are symmetrical, monitoring individual container resources can be very noisy.

Because metrics can change on an hourly basis, it is more important to look at patterns over long periods of time for groups of containers. For example, when a new ReplicaSetID is created, the ReplicaSet metrics are reset. You can use cAdvisor to aggregate metrics from multiple containers, including CPU, memory, and network usage.

2. Track the API Gateway

Detailed resource metrics (CPU, load, memory, etc.) are important to track, but they are not closely correlated with problems that directly impact users. A better KPI is API indicators such as call errors, request rates, and timeouts, and can help you quickly determine if there is a user-facing or application-facing problem with your microservices.

The easiest way to get information about service level metrics is to use a service load balancer (preferably an ingress controller like NGINX or Istio). This can allow you to automatically detect anomalies in REST API requests, in a standardized way across all Kubernetes services. You can raise alerts at any level of the API lifecycle, and also use alerts to automate infrastructure changes.

3. Track Kube-System Patterns

You should monitor the kube-system as closely as possible. Problems inside the Kubernetes clusters are typically the most difficult to solve, and should be detected early. These can include DNS bottlenecks, network congestion, or worst of all, etcd problems. In particular, it is important to monitor the performance of master nodes including CPU usage, memory, and disk space.

4. Monitor Each Layer Separately

It is important to track errors, crashes, and performance issues at every layer of the Kubernetes environment. Complex issues may require debugging at multiple levels, and engineers need to have access to metrics for every component. For example, when debugging an issue, a developer may need to:

  • Ensure a container is running
  • Troubleshoot issues with a pod
  • Collect ongoing metrics from the controller manager

5. Always Alert on High Disk Usage

High disk utilization is a common problem on any system, and Kubernetes nodes are no exception. If you are using StatefulSet resources or volumes that are statically attached to nodes, there is no quick fix.

Disk utilization warnings are almost always severe and usually indicate a problem with the application. Make sure to keep track of all disk volumes, including the root file system, and set alerts for around 80% utilization. Over time, try to see if there are patterns to high disk usage and make changes to your deployment to address the root cause.

Kubernetes Monitoring with NetApp Cloud Insights

NetApp Cloud Insights is an infrastructure monitoring tool that gives you visibility into your complete infrastructure. With Cloud Insights, you can monitor, troubleshoot and optimize all your resources including your public clouds and your private data centers.

Cloud Insights helps you find problems fast before they impact your business. Optimize usage so you can defer spend, do more with your limited budgets, detect ransomware attacks before it’s too late and easily report on data access for security compliance auditing.

In particular, NetApp Cloud Insights helps you gain an understanding of your Kubernetes architecture through topology visualization, including relationships between persistent volume claims and the storage infrastructure they’re using, and monitor health of Kubernetes clusters .

Start a 30-day free trial of NetApp Cloud Insights. No credit card required.


New call-to-action