Given the importance of IT systems, it would be easy to assume that they all rely on highly available infrastructures. Yet, for a number of reasons—namely costs and complexity—that is often not the case, with many businesses taking this as a business risk.
How can you design a highly available deployment on Google Cloud? Google Cloud Platform offers a number of services that are global or multi-region by default and that can help engineering teams build robust systems. It is then worth exploring the native Google Cloud Platform capabilities to design highly available systems from the ground up.
Google Cloud Platform, a leading voice in Site Reliability Engineering, currently has 24 regions, with a few more already planned to open soon. Each Google Cloud region is an independent geographical area that has at least two or more Google Cloud availability zones. A good practice is to deploy across multiple Google Cloud zones to improve resilience against a single zone failure, for example, a unique data center. While using multiple zones within a region increases the overall system availability, and failures of an entire region are not a common occurrence, it has been shown that when those rare region failures happen, it has disastrous consequences and a very negative and visible public impact.
Google Cloud Platform has multiple services that can offer multi-region availability by default. Good examples are Cloud Key Management Service, which enables you to manage your data encryption and decryption keys, and Cloud Load Balancing, which provides you load balancing capabilities for both internal and internet-facing applications. A few services are not just multi-region but actually global by default, such as Google Content Distribution Network and Cloud DNS, one of the few cloud services that provides a 100% uptime SLA.
Data Resiliency and Availability
The Storage Challenge
Data storage and stateful workloads are the most challenging aspects when designing a highly available system. Keeping data replicated and in sync while geographically distant services are reading and writing data is incredibly challenging to implement when traffic and data volumes are high.
Google Cloud Platform has a few data storage services that provide built-in multi region support, such as Cloud Firestore (NoSQL databases for web and mobile applications) and Bigtable (NoSQL database for large analytical and operational workloads), Spanner (managed relational database), and BigQuery.
Vendor Lock-In Considerations
While providing a huge benefit by taking away the operational burden of making data highly available, these multi-regional database services are based on Google’s own proprietary technology and therefore may result in vendor lock-in. With some managed services, such as Google Cloud SQL, the customers can use open-source engines, such as PostgreSQL and MySQL, that enable them to easily migrate away from Google Cloud and use the same database engine elsewhere. However, these services are usually bound to a single region and any multi regional capabilities will need to be implemented by yourself. It is a tradeoff that requires careful planning and thinking.
When looking purely at object, block, and file storage capabilities, the high availability infrastructure options vary. Google Cloud Storage supports multi-regional capabilities, while Cloud Persistent Disk and Cloud Filestore have regional replication. Multi-zone redundancy might create a challenge if there is an outage in that region. With data storage in Google Cloud priced at a few cents per gigabyte, it is definitely worth replicating it to other Google Cloud regions even if this means developing that logic yourself for Google block and file storage services, which don’t have built-in capabilities to replicate between regions.
More Options for High Availability Infrastructure on Google Cloud
As you can see, there is a lot more than meets the eye when it comes to designing and running highly available workloads in Google Cloud. Understanding the fundamentals of what the native building blocks are is important. Likewise, understanding that a lot of those services with built-in multi region high availability infrastructure come at the expense of possible Google lock-in.