All of us are on the brink of failure.
“Every veteran IT professional knows what it means to have your site down. No matter what the cause, we need to get it back online as fast as possible. Why? Simple – when our sites are down, we are exposed and vulnerable. Our ecommerce cart is down so we lose money. Not to mention, our reputation takes a hit with every minute our users are banging their heads against their keyboards in frustration.
There are multiple causes for IT failure. On a larger scale, the cause might be due to a whole cloud region being down, or on a smaller scale, a few corrupted files might have been accidently uploaded by one of the developers.
In addition to identifying the root cause of the failure, there is always the question of who or what is to blame for it – be it the result of human error or a malfunction in an automated process or in a physical appliance.”
In this article, we will discuss concepts and tools that we hope will help you focus your efforts to better protect your online service so that when outages happen, you are less panicked and more prepared.
The words above in italics were written only a few weeks ago. Not long after, the news of the devastating AWS outage in the US-East region (Virginia) erupted.
Many web sites, services and even gadgets (IoT) were heavily affected. Perhaps, the biggest story from the outage is the fact that one of businesses suffering from the outage was AWS itself, as the company relies on its own infrastructure for various services.
Amazon is a big name, but it’s by no means the only big company that took a hit of this kind.
Check out the list below:
|DNS provider Dyn||A DDoS Attack by a botnet consisting of a large number of Internet-connected devices, such as printers and IP cameras, that were infected with the Mirai malware.|
|Microsoft Office 365||Physical components degraded due to heavy users’ demand.|
|Salesforce||A service disruption caused by a database failure.|
|Google Cloud Platform||An outage affecting the cloud instances and VPN service in all regions. The outage was caused by bugs in the network-management layer.|
|Amazon Web Services in Australia||As storms pummeled Sydney, Australia, on June 4, 2016, an Amazon Web Services region in the area lost power and a number of EC2 instances and EBS volumes hosting critical workloads for name-brand companies subsequently failed.|
Before continuing, it is important to understand where AWS’s responsibility ends and yours begins. The general rule is that what’s in the cloud is the user’s responsibility while the cloud’s security itself is the responsibility of Amazon.
To understand more, make sure you carefully read the AWS shared responsibility model – and make sure your colleagues do as well.
Responsibility at the Vendor Level
Over the last decade, both AWS and Azure – the cloud infrastructure industry leaders – have built up a great global presence. In fact, AWS has 42 availability zones (AZs) spread across 16 different regions while Azure is in 34 regions.
According to the cloud vendors, these AZs and regions are segregated physical data centers. They are well secured and adhere to strict compliance standards.
In addition, each cloud vendor has its own out-of-the-box replication and recovery mechanism for its Database-as-a-Service (DBaaS) offering or for its object storage pool.
Amazon provides users with the option to create a read replica of its RDS service across regions, so in case of a failure you can make your replica the new master database.
The same goes for Azure SQL active Geo-Replication. You get a robust physical site, and “cloud building blocks” that enable you to build and customize your backup and recovery processes.
You can use the AWS console to take your next EBS volume snapshot, but that won’t work on scale. And, it definitely won’t work if you tend to forget things.
The same is true of those responsible for running failover processes:
- Automate your compute and block storage snapshot, continually and consistently.
- Stream your data between regions
- Separate privileges between your backup repositories and your other environments, especially the production one.
4 Key Cloud Building Blocks at the Application Level
1. Data Replication and Backup
First, it is recommended that your cloud deployment blueprint includes data replication and backup, and is part of the complete application architecture plan.
2. External Discs
On the application level, you should also consider the link between your application and your storage. The basic and most obvious point (though still important to mention as it’s not an uncommon occurrence), is that data should be stored on external disks, and not on the application's local server.
Apart from the scenario in which the server goes down, doing so also supports cloud elasticity, where instances comes and go.
3. Persistent Backup
Another important checklist item is keeping a persistent backup. Contrary to the aforementioned DBaaS option, you should consider limitations.
For example, Amazon RDS data retention is for 35 days, so you will need to come up with both a data migration and sync solutions, moving data to another region or even to another cloud. Mature databases such as Oracle, do come with a native support of persistent backup, but might not leverage the cloud scalability.
4. Backup Consistency
This should also be looked upon: AWS provides an EBS snapshot, which is a point-in-time backup of the volume. Automating around this cloud building block supports the need for a consistent backup on the cloud. This means no data loss and reliable recovery.
Another option is Linux (Logical Volume Manager) LVM, which allows your disks to be backed up while the data on the volume keeps on changing, eliminating the need to stop your storage device.
Automate Disaster Recovery Tests
No matter what the planning process, good backup and recovery processes are measured during a real event. So why not simulate these events on a regular basis, measuring your system robustness continually?
Some of the most interesting and inspiring developments on testing your disaster recovery systems were presented by Netflix, the Amazon cloud poster child.
Over the years, Netflix R&D talents open-sourced tools such as Chaos Monkey and Chaos Kong, tools that create random malfunctions across the Netflix cloud stacks. These put their system self-healing processes to the test, making sure that points of failure are proactively revealed and fixed.
With the cloud, you should strive to automate a test routine, checking that your backup repositories are up to date and that your failover scripts are doing their job.
As a result, you will not only have confidence in your system's robustness but will also identify problems; if not fixed on the spot, at least you will be able to identify the manual steps that the team needs to take to resolve issues in a timely manner .