Cloud Availability Nightmares and How to Avoid Them in Cloud File Sharing

Those of you who’ve managed clustered file systems in a data center have probably never had to worry about maintaining a single source of truth (SSOT) — a data-storing structure where data exists in one copy that is only referenced, not duplicated, for access by various systems. The same can’t be said when it comes to cloud file sharing and maintaining the high availability of a file service.

Though maintaining a SSOT isn't really an issue in data centers, what if it’s not your box and you don't have shared storage? What if you're migrating or extending your services to AWS, where an Amazon EBS volume is assigned to a single server? How do you deal with cloud availability issues?

While Amazon Elastic File System (Amazon EFS), Amazon FSx, and Azure Files are excellent choices for cloud-based file shares, data center file share users may see limitations in those cloud services’ backup, throughput and scaling capabilities. Suddenly, finding a solution seems very difficult. In fact, it begins to look like a nightmare.

In this article we will show you some of the traditional high availability file services in the cloud and how they can fail, as well as offer you some solutions in order to avoid those nightmare scenarios, including how to use Cloud Volumes ONTAP as a high availability solution.

High Availability File Services in the Cloud

Maintaining a single source of truth in a data center hasn’t ever posed a problem: Most organizations build a file server cluster on shared storage backed by a Storage Area Network (SAN) so that the data is replicated or synced at the storage layer; or they choose an even easier option, such as exporting a CIFS/NFS-style server as a virtual device from an enterprise storage appliance. It’s shared storage, HA, fast, redundant hardware and streamlined in a box. Maintaining one in the cloud is a different story.

If you haven’t managed file replication in the cloud, you’re in for quite a task. There are a ton of available solutions, but each have strengths and weaknesses. Most often, people choose the free solutions: GlusterFS and Microsoft DFS with DFS-R.

GlusterFS (RedHat Storage Server):

With GlusterFS you can export NFS shares and that data will replicate between nodes. Where the configuration becomes more challenging is setting up SAMBA to export CIFS for Windows clients. If you configure Linux to join a Kerberos domain, such as Active Directory, the challenge compounds in configuring SAMBA to properly pass extended ACLS. 

It becomes laborsome to manage permissions and, even worse, if your bricks (GlusterFS volumes) enter a split brain scenario. In split brain, there is no consensus between two servers with regards to which data is the “good” data. This can happen in a variety of ways but, if both servers think they are master, there can be a deadlock and your data integrity is put in jeopardy.

Creating three-plus node clusters or clusters with an arbiter voting node help to alleviate the split brain issue, but the catalog reconciliation can still occur with a restarted or failed node. GlusterFS operates on a Master, Replicas model so the source of truth remains with the master, unless a split brain occurs.  

Microsoft Distributed File System (DFS) and DFS-Replication (DFS-R):

The Microsoft equivalent is simple to manage from permissions as AD permissions can be applied to DFS virtual folders in a DFS Root. The virtual folders are assigned “targets” that can be spread across nodes. This is where DFS-R comes into the game.

DFS-R will sync data between targets, and failover occurs after the first call to the failed referral. Referrals are where Microsoft shines in maintaining the source of truth. In setting up a two- or three-node cluster, DFS will allow you to set up the referral (or the node which DFS will return to requesting clients).

In a three-node cluster, Node A can be assigned “Always refer First,” and Node C can be assigned “Refer last”. That way, you know your source of truth is on Node A and, if it fails, it will be on Node B and so on.

Using referrals is highly recommended so that you don’t get into the issue of trying to reconcile data after a failure, as the “good” files could be written to any server.

File Service Nightmares

The solutions above do work, but be aware — when they don’t work, your support folks are in for a long weekend. Here are two things that can go wrong:

1. Catalog Failure

DFS can fail. Each node maintains a catalog database of its files. If it becomes corrupted, the replication service will stop and its data will become stagnant. As this happens, the only “easy” way to fix it is to wipe the data (to be sure) and Robocopy the data from your good source of truth.

If the referrals are turned off, how would you know which files are good and which are old? Consider this: DFS-R will not replicate a file unless it’s not in use. It also makes the decision as to which file should win in replication based on a timestamp.  

What if your files were highly transactional? Take this example: An OCR platform writes to an XML file ten times through the ingest and processing. If no referrals are used, the file flow could look like this:

  • Transaction 1 of 10 - written to A (Copied to B, not C)
  • Transaction 2,3 of 10 - written to C (Copied to A and B)
  • Transaction 4 of 10 - written to B (not copied)
  • Transaction 5,6,7 of 10 - written to A (Copied to B and C) 
  • And so on...

At this point, you can see that A has the newest timestamp, but only includes data from transaction 2,3,5,6,7. You could imagine how painful this would be to reconcile.

2. Too Much to Sync

Whether using Gluster or DFS-R, moving data to the cloud is time consuming and often requires a syncing technology, most often rsync or Robocopy. Both tools are extremely robust and can copy your data any way you choose, but in essence, they copy all the data from point A to point B. Subsequent copies will file check, skip duplicates and copy deltas.

High Availability File ServiceWhat if your application required moving 220T to AWS?  Not a big deal, you order a few Amazon Snowballs, sync your data, ship it and AWS will upload to Amazon S3. But what if your data wasn’t designed for block storage, but standard POSIX/NTFS-style file storage? What if the rate of change was over 100G per day? How would you move that much data?

It’s likely that it would take a considerable amount of time and you may never catch up to the rate of change, much less perform additional passes to copy missed files. It may sound like a daunting amount of data, but it’s common for enterprise customers, and the challenge to migrate the service to AWS storage or Azure storage is great.  

File Services Solutions

You could read article after article about how building a data center in the cloud is completely different than managing your on-premises data center. When it comes to high availability file services, the same thing is true: They are totally different.

In this section we’ll look at these three solutions: Refactoring for the cloud, file service microservices, and non-native data management solutions with NetApp.

1. Refactoring for the Cloud Data Center

If your applications need file services, refactor them to use high availability block storage, such as Amazon S3, and abandon that old POSIX/NTFS-style file server. Amazon S3 is the AWS storage format that scales endlessly and doesn't need backups as it can auto-version and auto-replicate.

Refactoring applications can be extremely costly and, in some cases, impossible — say with a code base that hasn’t been updated in twenty years with no engineers on staff who were part of the build. If it can be done, it is a worthy effort to build for longevity. The storage cost alone can be its own appeal to pay back the refactoring effort.

2. File Service Microservices

If you simply must have a Microsoft or Linux file server, treat it as part of your application ecosystem and prepare for quick recovery. For example, store application A's files on its own server, on as small of an instance as it will run, and with the minimum needed IOPS.

Typically, General Purpose SSD (gp2) is perfect for file storage. Snap the volume often and if there is a server issue, restore from a snapshot.

This isn't high availability, but the cost is low and recovery is fast. You won't be chasing the source of truth and one failure won't bring multiple ecosystems down. You can predict your failure points and have a DR runbook or, better yet, automation to recover.

3. Data Management Solutions with NetApp: Cloud Volumes ONTAP and Cloud Sync

NetApp is a storage leader that has extended its data management solutions into the cloud with Cloud Volumes ONTAP.

If you are an existing NetApp customer, your on-premises appliances can sync with Cloud Volumes ONTAP using SnapMirror® data replication technology for file replication. The ONTAP instances can manage roughly 360TB each. Cloud Volumes ONTAP running on AWS also provides capabilities to launch highly available storage with zero data loss (RPO=0) and automatic failover.

This is an ideal situation for Azure or AWS high availability. You can use a Cloud Volumes ONTAP HA (high availability) pair as an active-active configuration in which both nodes serve data to clients, or as an active-passive configuration in which the passive node responds to data requests only if it has taken over storage for the active node.

Unlike the native OS solutions, Cloud Volumes ONTAP can be customized to use a virtual endpoint that leverages routing to the ENIs to maintain availability and auto-healing on the file systems, in case of a catalog or replication issue. Take the pain out of chasing failures and syncing volumes and let it happen automatically.

Lastly, when you are ready to make the migration from standard file systems to block storage, NetApp’s Cloud Sync can take your existing data and migrate it to Amazon S3 for you.

There’s no need to set up lengthy and complicated multi-part upload/file hash checking scripts. It can convert files into objects, sync, save the directory hierarchy and file attributes, provide progress reports and continue to move files as you remain in an interim state, until you finally cut over.


The storage layer of your solution is often the most challenging to manage in the cloud-centric, DevOps world. It is stateful and requires careful design considerations, and you want to run through the cloud availability failure scenarios before they happen, because failures will happen.

When they do, the last thing you want is to be managing disparate data sources and trying to figure out what data is the right data, all under the gun, after the solution fails. This is why managing the source of truth is so critical to every piece of your solution. This is why you must double down on a vendor you can trust to ensure your data resiliency when expanding to the hybrid cloud or completely migrating.

With NetApp, you'll have a partner that you can trust to maintain and transfer your data safely and securely. If you want to avoid the high availability file service nightmares of disparate data, find out more about NetApp data management solutions: Cloud Sync and Cloud Volumes ONTAP.

Cloud availability issues aren’t the only precaution to worry about when it comes to running a file share in the cloud. Check out the other resources, including articles that:

Want to get started? Try out Cloud Volumes ONTAP today with a 30-day free trial.