AWS High Availability

High Availability Cluster: Concepts and Architecture

[Cloud Volumes ONTAP, High Availability, Elementary, 6 minute read, AWS High Availability]

What is a High Availability Cluster?

High availability clusters are groups of hosts (physical machines) that act as a single system and provide continuous availability. High availability clusters are used for mission critical applications like databases, eCommerce websites, and transaction processing systems.

High availability clusters are typically used for load balancing, backup, and failover purposes. To successfully configure a high availability (HA) cluster, all hosts in the cluster must have access to the same shared storage. In any case of failure, a virtual machine (VM) on one host can failover to another host, without any downtime.

The number of nodes in a high availability cluster can vary between two to dozens of nodes, but storage administrators should be aware that adding too many virtual machines and hosts to one HA cluster can make load balancing difficult.

This article is a conceptual overview of high availability architectures, presented as part of our series of articles on AWS high availability.

In this article, you will learn:

How Does High Availability Clustering Work?

High availability server clusters are groups of servers that support applications or services, which need to run reliably with minimal downtime.

High availability architectures use redundant software performing a similar function, installed on multiple machines, so each of them can be used as a backup when another component fails. Without this clustering, if an application or website fails, the service will not be available until it is repaired.

A highly available architecture prevents this situation using the following process:

  1. Detecting failure
  2. Performing failover of the application to another redundant host
  3. Restarting or repairing the failed server without requiring manual intervention

The heartbeat technique
High availability cluster servers typically use a replication method known as heartbeat. The purpose of this technique is to monitor cluster node health via a dedicated network connection. Each node in the cluster constantly advertises its availability to the other nodes by sending a “heartbeat” over the dedicated network link.

One of the critical conditions that must be prevented in a high availability cluster is a “split brain”. Split brain happens when all private internal links are cut off at the same time, but the cluster nodes are still functioning.

In this case, all nodes of the cluster may mistakenly assume that all other nodes are down, and try to start services that other nodes are already running. With multiple versions of the same service, all of which may be exposed to users, and can result in data corruption. 

Related content: see our guide to Achieving Application High Availability in the Cloud

High Availability Cluster Concepts

Active/Passive Cluster

A failover ensures that when a node loss occurs within a service group, it is quickly offset by other nodes in that location. This way, when a node fails, its IP address automatically moves to a standby node. You can use a network routing tool (for example, a load balancer) to redirect traffic from a failed node.

In an active/passive model, initially, only one node serves customers, and continues working alone until it fails for some reason. At that point, both new and existing sessions are transferred to a backup or inactive node. Always add one more redundant component for each type of resource (n + 1 redundancy) to ensure you have sufficient resources for existing demand, while covering potential failure.

Active/Active Cluster

In a cluster with an active/active design, there are two or more nodes with the same configuration, each of which is directly accessed by clients.

If one node fails, clients automatically connect to the other node and start working with it, as long as it has enough resources (because one node is now handling the load for two nodes). After restoring or replacing the first node, clients are again split between the two original nodes.

The main benefit to running an active/active cluster is that you can effectively achieve node-network balance. A load balancer sends all client requests to available servers and monitors node-network activity. The load balancer moves traffic to nodes that can better handle that traffic, using predefined algorithms.

The routing strategy can follow a round robin model, in which customers are distributed arbitrarily between the available nodes, or it may follow a weighing scheme, in which one node takes precedence over another by a certain percentage.

In a cluster configuration combining active/active and active/passive, redundancy can be greatly improved by adding a passive node, alongside the active nodes. If a service cannot tolerate downtime, you should aim to combine active and passive high availability models.

Shared-Nothing vs. Shared-Disk Clusters

A core principle of distributed computing is that single points of failure should be avoided. This means that resources must be actively replicated (redundant) or replaceable (using a failover model), with no one factor that can disrupt the entire service if it goes down.

Imagine running dozens of nodes that depend on a single database server for their functionality. Regardless of the number of nodes, failure of one node does not affect the persistent state of the others. But if the database fails, the entire cluster is unusable. Therefore, the database is a single point of failure. This is known as a shared disk cluster.

By contrast, if each node maintains its own database, assuming of course they are synchronized between them for transactional consistency, node failure will not affect the entire cluster. This is known as a shared nothing cluster.

4 Requirements of a Highly Available Architecture

A high availability cluster architecture has four key components:

1. Load balancing

A highly available system must have a carefully designed, pre-engineered mechanism for load balancing, to distribute client requests between cluster nodes. The load balancing mechanism must specify the exact failover process in case of node failure.  

2. Data scalability

A highly available system must take into account scalability of databases or disk storage units. The two common options for data scalability are using a centralized database and making it highly available with replication or partitioning; or ensuring individual application instances can maintain their own data storage.

3. Geographical diversity

In today’s IT environment, especially with the availability of cloud technology, it is essential to distribute highly available clusters across geographical locations. This ensures the service or application is resilient to a disaster affecting one physical location, because it can failover to a node in another physical location.

4. Backup and recovery

Highly available architectures are still subject to errors that can bring down the entire service. If and when this happens, the system must have a backup and recovery strategy, so that the entire system can be restored within a predefined recovery time objective (RTO). A common rule for backups known as “3-2-3” states that you should keep three copies of the data, on two media types, in three geographical locations.

Related content: read our guide to Creating Highly Available Systems with RPO 0 on AWS

High Availability for Enterprise Data with NetApp Cloud Volumes ONTAP

NetApp Cloud Volumes ONTAP, the leading enterprise-grade storage management solution, delivers secure, proven storage management services on AWS, Azure and Google Cloud. Cloud Volumes ONTAP supports up to a capacity of 368TB, and supports various use cases such as file services, databases, DevOps or any other enterprise workload, with a strong set of features including high availability, data protection, storage efficiencies, Kubernetes integration, and more.

In particular, Cloud Volumes ONTAP provides high availability, ensuring business continuity with no data loss (RPO=0) and minimal recovery times (RTO < 60 secs).

New call-to-action