Cloud Infrastructure Monitoring Best Practices
Monitoring is a skill, not a full-time job. In today’s world of cloud-based architectures implemented through DevOps projects, developers, site reliability engineers (SREs), and operations staff need to collectively define an effective cloud monitoring strategy. Such a strategy should focus on identifying when service level objectives (SLOs) are not being met and are likely to have a negative impact on the user’s experience.
An example NetApp Cloud Insights dashboard showing key service level indicators (SLIs).
Best practices for cloud monitoring encompass four major areas, or signals:
- Latency, the time it takes to service a request
- Saturation, how loaded a resource is
- Traffic, the demand for a resource
- Errors, the rate at which requests fail
It’s difficult to monitor these signals with single-element managers. Relying solely on element managers would involve manually identifying the relationship between individual resources and calculating the extent of their correlation, which would be likely to contribute to a service objective breach.
To avoid such a tedious—and potentially noncompliant—process, you need to adopt a cloud monitoring tool that not only reports the metrics of a single resource, but also shows you how it’s connected to other resources in your overall infrastructure.
An effective cloud-monitoring strategy relies heavily on setting an alerting policy that recognizes and filters out false positives. For example, setting an alert to go off when the CPU reaches 90% saturation will probably set off a flood of alarm bells if your CPU regularly reaches 90% saturation as part of a normal load sequence, such as nightly backup. But if the alerting mechanism that's built into your cloud monitoring tool recognizes that your system reaches 90% saturation during those backups, the barrage of false alarms is prevented.
Further, when configuring your cloud monitoring service, you should set up an alerting mechanism that can detect an otherwise undetected breach. If left unchecked, such a breach can have a negative impact on the end user.
We built NetApp® Cloud Insights specifically to address such monitoring needs. Cloud Insights is a SaaS cloud-monitoring tool that gives you actionable knowledge of your infrastructure, including real-time data visualizations of the availability, performance, and usage of your entire IT infrastructure. Cloud Insights can also give you insight into public clouds—specifically, AWS, Azure, and Google Cloud—as well as on-premises multivendor resources. Cloud Insights supports more than 100 data collectors.
Dashboards Create Opportunities for Granular SLI Monitoring
With Cloud Insights, you can easily create custom dashboards to answer straightforward questions that are, nevertheless, difficult to answer. Questions like:
- Where is latency unacceptable?
- What resources are saturated?
- Where is traffic driving high latency, and on which systems?
A key advantage of Cloud Insights is that it automatically discovers service paths, allowing you to see the relationships between resources and events to better understand cause and effect. This means that if you’re experiencing high latency, for example, you can easily see which resources are probably correlated and break them down into responsible resources and affected resources.
The following Expert View report is an example of a latency policy breach on VM win2K16serv35. It shows an 87% correlation to the host ocise-esx-…netapp.com. By selecting that host, you can overlay its CPU utilization graph; the overlay immediately exposes a close correlation. But you can also see a high likelihood that two other VMs are affected by this latency breach.
Finding the root cause: a sample report showing top correlated and degraded resources.
Getting these insights by using only VM and data storage element managers would have been difficult and time consuming.
Setting Up Alerts
Using Cloud Insights, you can create alerts that detect when a resource has exceeded a specific service level indicator. More importantly, you can create alerts based on the relationship between multiple indicators. For example, instead of setting alerts that ping when a single threshold is met (and suffering the consequent avalanche of alarms), you can specify the severity of an alert and when it’s triggered, in conjunction with other variables, like the amount of time that the threshold must be exceeded.
To create an alert, you simply specify the alert name, the object type, and any annotations. Here is an example alert creation dialog box.
Adding a policy alert dialog box.
Unique to the Cloud Insights alerting system is its ability to easily specify multiple thresholds. You can specify as many thresholds as needed for a given object type. You can set an alert to take effect if any of the thresholds are crossed, or you can specify that all thresholds must be met in order to trigger an alert.
In this video I recorded while at Re:Invent last year , I show examples of using Cloud Insights to monitor, troubleshoot and optimize your entire infrastructure.
To get started with Cloud Insights now, watch the on-demand webinar, and visit Cloud Central and register for our 14-day free trial. To summarize, Cloud Insights helps you define an effective monitoring strategy for both your multivendor on-premises and cloud infrastructure. Cloud Insights goes beyond simple element managers that show you relationships between resources. It also allows you to set complex alerts that minimize false positives and maximize your ability to find problems before they affect your users.