hamburger icon close icon

AWS High Availability Best Practices: Placement Groups, Single Vs. Multi-AZ, and More

Business organizations make use of applications that require different levels of availability and different SLA objectives. How critical or demanding an application is considered is proportional to its requirements in throughput, responsiveness, and recovery time in case of a failure. The same considerations hold true when forming AWS high availability best practices.

Depending on the deployment’s specific requirements, distributing compute and storage across AWS Availability Zones in combination with Placement Groups is a way to address this challenge in AWS high availability. An optimized combination of those options, along with a Cloud Volumes ONTAP HA deployment, can meet the requirements presented by each layer.

In this article we are going to review these AWS high availability best practices and use cases for single and multi Availability Zones, and Placement Groups. We’ll also look at the added benefits that Cloud Volumes ONTAP HA can bring as a solution at the storage level.

AWS Regions and Availability Zones

Amazon has all of the hardware data center resources which support their services spread over geographically isolated areas called AWS regions. There are AWS regions in North America, Europe, Asia, and South America, with more regions being implemented soon. Each one of these regions is independent and isolated from the others. Resources are not replicated from one region to the other unless you specify that preference, which comes at an additional charge. This, of course, achieves fault tolerance and stability as, in the event that a whole region fails, it wouldn’t affect the services provided by another region.

Now, each region is composed of multiple isolated Availability Zones (AZ). There are 66 Availability Zones within 21 geographic regions. Each Availability Zone is a fully isolated partition of the AWS regional infrastructure. A single AZ may be housed in more than one data center facility with each individual data center having redundant power, networking, and connectivity, plus redundant network links between each data center. Each AZ is located away from any other AZ in the same region by a considerable geographical distance. These Availability Zones communicate with each other over low-latency, high-bandwidth, redundant metro fiber links. This is all built to eliminate the AZ as a single point of failure and increase reliability.

For instance, when it comes to important traffic, such as enterprise databases—whether hosted on Amazon EC2 instances or on Amazon native database services (such as Amazon RDS)—a multi-AZ distribution model gives you high availability in case a major failure occurs in an entire Availability Zone. Critical production applications that can’t afford even a moderate amount of downtime benefit from this model and have to consider this type of general failure as a real possibility. The same goes for the upper tiers that this application may be composed of. If the web services of an app are all hosted in one AZ, having the underlying databases in an HA multi-AZ configuration won’t help much if the web tier is hosted in only one AZ. From this high availability perspective, in a single AZ deployment, if the AZ goes down everything goes down and the Recovery Time Objective goes way higher. This not to mention the data loss that will take place in-between.

Other important benefits from having multi-AZ deployments include:

  • No I/O delays during backups, as backups are taken from the standby instance.
  • No interruptions to I/O when applying patches or performing upgrades for maintenance purposes.
  • Increase in responsiveness when load balancing is used. If one AZ is constrained, the instances in other zones can digest the traffic.

Of course, not all application use cases require a multi-AZ deployment. Temporary tests, dev deployments, or any use case that is not critical can get hosted in a single AZ and avoid the additional costs that come with running a multi-AZ. There are even high-intensive, extreme-low-latency use cases that fit the single-AZ model better.

AWS Placement Groups

Simply put, a Placement Group is a configuration option that AWS offers which lets you place a group of interdependent instances in a certain way across the underlying hardware on which those instances reside. The instances could be placed close together, spread through different racks, or spread through different Availability Zones. Let’s take a closer look at each one of the Placement Group types you can choose from and types of workloads that would best fit into each distribution option:

1. Cluster Placement Groups

The cluster placement group configuration allows you to place your group of interrelated instances close together in order to achieve the best throughput and low latency results possible. This option only lets you pack the instances together inside the same Availability Zone, either in the same VPC or between peered VPCs.

The advantage with cluster placement groups is that the communication between those instances is not limited to single-flow traffic of 5 Gbps but to 10 Gbps single-flow (point-to-point) traffic and a total of 25 Gbps for aggregate traffic. HPC (High Performance Computing) network-bound applications are the best use cases for this deployment model. Computational engineering, live event streaming, genomics sequencing, astronomy models, and earth-climate compute models are examples of use cases for this type of grouping in the cloud.

2. Partition Placement Groups

With partition placement groups, you can group your instances in separate logical partitions that form the placement group. The idea of this is to have each one of the logical partitions built on top of separate hardware racks in order to avoid common hardware failures. If one rack fails, it will only affect the instances residing on this logical partition. Each logical partition is composed of multiple instances. The partition placement group option allows you to place those partitions within a single AZ or in a multi-AZ setup within the same region.

So, what type of loads would best fit this model? Big data stores which need to be distributed and replicated are good examples. Big file systems such as HDFS or Cassandra are also great fits. Partition placement groups allow you to see which instances are placed into which partitions so you can make Hadoop or Cassandra topology aware and configure data replication properly. Any use case needing big data analysis, data reporting, or large-scale indexing would also be a good fit for partition placement groups.

3. Spread Placement Groups

With spread placement groups, each single instance runs on separate physical HW racks. So, if you deploy five instances and put them into this type of placement group, each one of those five instances will reside on a different rack with its own network access and power, either within a single AZ or in multi-AZ architecture.

The spread placement group setup may be similar to partition placement groups, but the main difference is that partition placement groups are made of several instances on each partition, while spread groups are just single individual instances spread through different racks or AZs.

This model is recommended for a small number of critical instances for your business. You could maybe have a small amount of SQL database instances running here or your web application tier. This setup is an ideal use case for redundancy since there is less requirement for the beefy computational power offered by partition and cluster placement groups.

Cloud Volumes ONTAP HA for AWS

The Cloud Volumes ONTAP HA configuration provides AWS high availability. Running on dual nodes of Amazon EC2 compute instances and storing all the data in the underlying Amazon EBS storage, operations can prevent any data loss from occurring in a failure and recover in less than 60 seconds.

In this Cloud Volumes ONTAP pair, all the data is mirrored between the two nodes, in either an active-active configuration, where both nodes serve clients, or in an active-passive configuration, in which one node is the standby. In both cases the data is synchronously mirrored each time there is new data written. This configuration can also be deployed either in a single AZ scenario or in a multi-AZ scenario:

  • Single-AZ: Both Cloud Volumes ONTAP nodes reside on the same Availability Zone. NetApp Cloud Manager automatically deploys both nodes with a spread placement group configuration in order to avoid common compute failures.
  • Multi-AZ : Each Cloud Volumes ONTAP node resides in a different Availability Zone, again eliminating the AZ as a single point of failure, which is not a feature that the native Amazon EBS storage replication offers. With this type of high availability configuration you need to set up an AWS Transit Gateway with floating IP addresses for the failover to work properly and provide permanent NAS or any data access.

Cloud Volumes ONTAP HA needs three Amazon EC2 instances to get working: two main nodes doing all the storage work and one small mediator t2.micro instance in charge of regulating and administering the automatic failover and failback related tasks. RPO (Recovery Point Objective) is zero, your data is always consistent since it is synchronously mirrored, and your RTO (Recovery Time Objective) is 60 seconds or less for data to be available again in case of a failover to the other node.

AWS has other native storage layer redundancy features such as Amazon EBS. As mentioned earlier, Amazon EBS only replicates within servers on a single Availability Zone and if you were to provide redundancy at the storage level by only using Amazon EBS, you would need to take Amazon S3 snapshots and transfer them over to a different Availability Zone, which has also an additional cost. Other AWS native high-availability features such as Amazon EFS export the stored data only through NFS, and it currently does not support Windows instances.

Conclusion

All the information provided in this article about AWS high availability best practices drives us to three main conclusions:

  1. Wherever there is a multi-AZ configuration present, an additional reliability point is scored as the entire Availability Zone itself is ruled out as a single point of failure.
  2. Different workloads have different sets of requirements which can fit better into either single-AZ deployments or multi-AZ deployments.
  3. Whenever centralized storage is needed, Cloud Volumes ONTAP HA is a solution that brings redundancy and fast recovery at the storage layer, whether you are deploying in a single-AZ modality or in a multi-AZ one. This at a lower or comparable price compared to running on raw Amazon EBS storage.

Your business continuity is important. Cloud Volumes ONTAP gives you the ability to ensure it.

New call-to-action

Aviv Degani, Cloud Solutions Architecture Manager, NetApp

Cloud Solutions Architecture Manager, NetApp

-