hamburger icon close icon

Storage QoS in NetApp Cloud Volumes ONTAP

Your enterprise depends on its high-functioning applications running at peak performance. Delivering that level of service consistently can be a challenge, whether it’s at the network level or for storage using Cloud Volumes ONTAP. Managing NetApp storage quality of service (QoS) is one solution to this challenge. 

This article will explain how high IOPS or MB/s throughput on a single workload can affect other workloads using the same Cloud Volumes ONTAP instance. Then we will show how to resolve this issue by assigning a high utilization volume to the QoS policy group and reduce its effect on other services.

What is Quality of Service?

Quality of service (QoS) is the management of overall resources of a service, and the reservation and prioritization of resources within that service, to ensure that the resources are distributed as required by users or systems such that the service does not impact utilization by users or systems.

There are two different ways that servers and services use storage: server storage and network storage.

Server Storage

Computer systems have always been designed around the principle that the internal disks are there to be utilized, not just for system and user files but also to effectively extend the amount of memory a system has. At first, simply storing chunks of memory on the disk when memory was low was sufficient, but now complex algorithms determine chunks of memory can be stored on disk to make room for storage of chunks of data from disk in memory. This is normal and just a function of the way servers work.

Therefore, the performance of most computer systems will be limited by how fast they can access their storage.

Network Storage

When a computer system is connected to network-based storage it will most commonly be via Storage Area Network, also called SAN, or Network Attached Storage referred to as NAS.

SAN storage appears to the server as a local disk, accessed at the block level and most commonly accessed via ISCSI (Internet Small Computer Systems Interface). Several servers may also be using this "local" disk, and it is the OS or application’s responsibility to stop conflicts. NAS storage is accessed at the file level, normally by NFS (Network File System) or CIFS/ SMB (Common Internet File System/ Server Message Block).

The server connected to network storage will behave as if it were the only computer connected and will attempt to utilize all the storage, creating a high workload on the Cloud Volumes ONTAP instance, with no regard for other systems also using that instance. For example, this would happen if several servers were connected to a storage volume on a Cloud Volumes ONTAP instance via an NFS share and all were writing files and generating high workloads concurrently. The server which can write the fastest, either because of faster network or more available processing, can reduce the victim's write speed to increase its own, and even effectively lock other workloads out.

This behavior is called Bullying, where one storage workload is bullying other storage workloads. Production workloads on a Cloud Volumes ONTAP can be impacted by, for example, development workloads consisting of many small writes generating high IO on the shared Cloud Volumes ONTAP. Or maybe a backup workload moving large chunks of data to give high MB/s throughput on the shared Cloud Volumes ONTAP, which impacts during off-peak hours and can be difficult to investigate.

Workload Management in Cloud Volumes ONTAP Using NetApp QoS

QoS can be used to manage any bully workload and restore performant access to the victim workload. On Cloud Volumes ONTAP we create a storage policy and attach storage resources to that policy, a storage resource could be an SVM, LUN, storage volume, or a file.

On Cloud Volumes ONTAP QoS is configured by setting the ceiling or floor performance levels, which are the maximum or minimum performance thresholds a workload can have. The performance can be specified in either IOPS (IO operations per second) or volume throughputs such as KB/s and MB/s.

Performance thresholds can be "Adaptive", which sets the threshold as a ratio of the resource size. Assume we have a database application which is using a storage volume with a ceiling set to 100 IOPS per GB, and the volume size was 4GB, the ceiling would be 800 IOPS. In the future, we may need to increase the volume size to 8GB, the ceiling would automatically increase to 1600 IOPS, which the database may require to utilize the extra storage without reduced performance.

QoS performance thresholds can be defined as shared or non-shared, which specifies whether the threshold applies to the resource or each server accessing the resource. For example, setting a shared ceiling of 200KB/s on a volume restricts the access to the volume to 200KB/s, but setting a non-shared ceiling of 200KB/s results in each server accessing the volume being restricted to 200KB/s. Therefore 4 servers accessing the resource would result in up to 800KB/s of access to the volume, but no more than 200KB/s each.

Find a Bully

Depending on your requirements, storage policies may be set for all volumes when provisioned. Since this could result in an underutilized Cloud Volumes ONTAP, assume that we have not yet set any volumes and we discover an application is restricted by storage. You can read more about the process of identifying performance issues in the “ONTAP Performance Management Power Guide” which is available from the ONTAP 9 documentation center.

A simple example of identifying a server producing a high workload is to run, from the CLI, the "statistics top client show" command, and from the results below it is shown that server1 is the bully, using a lot more Ops than other servers.

cluster1::> statistics top client show
                                              *Total
     Client    Vserver      Node      Protocol  Ops
-------------- ------- -------------- -------- ------
Server1 vser1 cloud-toaster1 nfs 10000
Server4 vser1 cloud-toaster1 nfs 800
Server3 vser1 cloud-toaster1 nfs 654
Server5 vser1 cloud-toaster1 nfs 452
Server7 vser1 cloud-toaster1 nfs 209
Server9 vser1 cloud-toaster1 nfs 170
Server2 vser1 cloud-toaster1 nfs 109
Server8 vser1 cloud-toaster1 nfs 92
Server10 vser1 cloud-toaster1 nfs 0


Manage the Bully

In this simple example, server1 is using a single volume called db_data shared via NFS, the problem can be resolved by creating a policy for the volume which is non-shared and has a ceiling of 1000 IOPS. This ceiling threshold will then be applied to each server and the resource will be shared more fairly. To do this, follow the steps shown below:

1. Create a policy group pg-db_data on SVM vser1 for the volume

cluster1::> qos policy-group create -policy group pg-db_data -vserver vser1 -max-throughput 1000iops -is-shared false2. Apply the policy group pg-db_data to volume db_data

cluster1::> volume create -vserver vser1 -volume db_data -aggregate aggr1 -qos-policy-group pg-db_data

We are now managing the bully down to 1000 IOPS.

Summary

QoS policies can be used on Cloud Volumes ONTAP to manage resources and guarantee that no bully workloads can restrict or stop other workloads using the service. This can ensure that the performance of any servers accessing Cloud Volumes ONTAP is not impacted by other servers which share resources.

New call-to-action
Aviv Degani, Cloud Solutions Architecture Manager, NetApp

Cloud Solutions Architecture Manager, NetApp

-