A disaster can take many forms—from losing power in your data center to a construction crew accidentally digging through your connectivity—leading you to lose access to your data. Although rare, these kinds of disasters happen, and no company is immune to them.
It’s common to have a business continuity plan, often shortened to BCP, which include multiple plans of action to minimize interruption and ensure business processes continue in the event of an unplanned incident. It’s also common to have a disaster recovery plan (abbreviated to DR plan) that deals with how to bring the company’s data back online, and to restore any lost data within the acceptable amount of time (recovery time objective) and it’s often included within the BCP.
But while these plans are essential, DR solutions can be high-cost purchases that—in the best of cases—you may never need to use. An on-prem DR site essentially doubles your IT costs as it requires the same expenditures for floorspace, hardware, power, and maintenance as the primary location. But since DR deployments aren’t optional, you have to find a way to control those costs. Using NetApp Cloud Tiering service is one option.
Tiering data from the DR solution to the cloud can reduce the cost significantly. In this article we will show you how employing Cloud Tiering can reduce the CAPEX and OPEX of your DR environment.
A disaster recovery plan typically requires a secondary storage system installed in another data center, with appropriate replications from the primary one. In this way, if there is ever a disaster, the secondary storage is promoted to primary, company systems switch over to use it, and business continues.
For NetApp users there are several solutions that can be used for disaster recovery. From a business continuity point of view, the main differences between them are related to data loss and recovery time they provide, known as Recovery Point Objective (RPO) and Recovery Time Objective (RTO):
Recovery Point Objective (RPO): RPO is a time value that quantifies how much data loss is acceptable. The lost data could be recreated, re-ingested, manually reentered, or unrecoverable. The RPO would have been previously agreed and stated in the BC/DR plan.
Recovery Time Objective (RTO): The RTO is a measure of the time it should take to recover from business interruption. This is the total time agreed upon to bring application services back online.
For business-critical workloads, where data loss cannot be tolerated and recovery must be quick and automatic, a NetApp MetroCluster configuration is typically used. With MetroCluster the strictest SLA (RPO=0 and RTO< 2 minutes) is achieved through synchronous data replication and seamless storage promotion to the applications. For other workloads, with less strict objectives, the NetApp SnapMirror replication engine is used, upon which we’ll focus on the rest of the blog.
With SnapMirror, replication relationships are created between the primary and the secondary storage at the volume or Storage Virtual Machine (SVM) level. These relationships are configured to match the desired RPO using synchronous or asynchronous replication with schedule-based updates. In SnapMirror terminology, the resource on the primary system is called the source, and the resource on the secondary (DR) system is called the destination.
Synchronous SnapMirror (SM-S)
With synchronous SnapMirror relationships, all writes to the source—which in this case can only be a volume—are replicated simultaneously to the destination volume on the secondary system. Acknowledgment of the write occurs after both writes have completed.
SM-S provides two replication modes. With Synchronous mode, zero RPO replication is provided without any restriction to primary I/O if a replication failure occurs. Strict Synchronous (StrictSync) mode also provides zero RPO replication, however if a replication failure occurs it stops the primary I/O from happening, ensuring both source and destination are always in sync.
With asynchronous SnapMirror, a replication relationship can be set at the volume level or at the SVM level, with syncs happening at set time intervals. The update schedule that was configured for each replication relationship defines how frequent updates to the destination occur, and in case of a disaster, the amount of data loss. The update schedule is effectively your RPO. For example, if your replication frequency is one hour and a catastrophe happens, you could lose one hour of data, so your RPO will also be one hour.
Although providing different levels of recovery point objectives, leveraging both replication types is quite common, even for the same application. For example, database data files are asynchronously SnapMirrored to the DR system, and the transaction logs are synchronously SnapMirrored.
By using the database data files replica and applying only the changes made since the last update by replaying the transaction logs, you can quickly recover the database and ensure it is in the same state as before failover.
With either asynchronous or synchronous SnapMirror relationships, when the primary storage system becomes unavailable, a failover procedure must be executed to resume the application I/O on the secondary storage system. With synchronous relationships, RPO=0 is provided and with asynchronous the RPO provided is based on the update-schedule selected. The RTO, in both cases, is very low and depends on the failover actions necessary and whether they are manually performed or automated through scripts.
Failover and Failback to the Primary Environment
In the event of a disruption, failover follows these processes:
- If you have lost connectivity to the primary system, but the DR NetApp hasn't, then you will need to quiesce the relevant SnapMirror relationships.
- Break the relevant SnapMirror relationships on the DR system.
- Make the destination accessible to applications (most of the configuration should already be in place).
- Point the application to the DR system.
The failback to the primary system depends on the requirements, timescales, and the SnapMirror relationship condition, but essentially the fail back process is always the same. If the SnapMirror relationship is still valid, then all the changes since the failover can be pushed back to the source by flipping and resyncing the relationship. Once completed, an update operation is performed on the reversed relationship to move across any changes since the resynchronization started
If the SnapMirror relationship is damaged, a new SnapMirror relationship from DR to Primary must be created and initialized, which will take longer to complete and will depend on the volume size and network throughput. Once the reversed SnapMirror relationship is in sync, you can fail back at any maintenance window and return to normal operations.
During fail back, the application will be stopped or disconnected from storage, and the final update is run to move over any last changes. Then, after a SnapMirror break operation, the application can be pointed back to the primary storage and the SnapMirror relationship can be flipped and resynced again and back to normal operations.
Cloud Tiering DR Data
Typically, a DR environment will consist of a secondary storage system that must have enough performance to maintain business operations during a disruption and needs to have the capacity to store at least a copy of all the critical volumes. In most cases, DR data sits idle until a disruption happens. That means by its nature, DR data is considered cold. DR systems often contain data backups for long term retention which consume extra storage and must also be considered.
Storage capacity for DR data is a significant factor that will have a large effect on the solution's overall cost, and this is where Cloud Tiering can make a substantial difference. Cloud Tiering enables automatic cold data tiering from your on-premises ONTAP clusters to an object storage located on the public cloud, such as Amazon S3, Azure Blob, and Google Cloud Storage, or to a private cloud offering such as NetApp StorageGRID.
Cloud Tiering offers three policies that allow data to be tiered automatically:
- Cold Snapshots - Only cold snapshots blocks are moved to object storage.
- Cold User Data and Snapshots - Both cold user data blocks and cold snapshots blocks are moved to object storage.
- All - All the data in the volume is marked as cold and moved entirely to object storage.
The recommended tiering policy for DR volumes is the All policy. The All policy ensures that any data written to the destination volume, through SnapMirror initialization and updates, is immediately tiered to cloud object storage, enabling efficient DR storage management and dramatically affects the total cost of ownership.
A consideration must be given to how much hot data is in use by critical or essential applications when calculating the DR environment’s performance tier storage capacity. Since the DR volumes will be cloud tiered, this will usually not provide the required performance when applications failover to it, as all data is effectively read from cloud storage.
Most likely, when an outage occurs and a failover is made to the DR volumes, you would change the DR volumes’ tiering policies to “Cold User Data and Snapshots” (Auto). This policy ensures that all randomly read blocks will become hot and promoted to the DR performance tier to satisfy performance needs for subsequent read operations.
Consequently, the DR system requires only enough performance tier capacity to store read data promoted and have a little breathing room for unexpected growth and metadata, and there is no necessity to match the primary's capacity.
This article has shown how to use cloud tiering with volumes in a DR environment to reduce the required storage capacity within it significantly, which reduces the final DR solution's cost without affecting the function or failover time scales.