With cloud adoption and the massive scale of software operations, the concepts of DevOps and Site Reliability Engineering (SRE) became hugely popular.
The DevOps and SRE terms are deeply intertwined and are often associated with the same tooling and processes. The reason for that is while DevOps represents a cultural and mindset transformation, SRE provides a concrete model that enables the implementation of that DevOps philosophy, offering a set of practices that closely align with the DevOps foundations.
Use the links below to jump down to the section on:
What Is the SRE Model?
The SRE model was developed at Google and was based on their own experience of operating several production systems at large-scale over the years. They standardized a set of engineering practices that enabled balancing the velocity of feature development with the operational reliability risks.
SRE practices encourage teams to share ownership and implement changes gradually to reduce the overall cost of failure. Combined with an organizational culture that supports this SRE mindset, teams start to accept operational failures as normal and learn from their own mistakes and incidents in a blameless manner.
On a very practical level, one of the key practices of an SRE team is to leverage toil and automation to measure and improve different aspects of their day-to-day operations. Any data, be it key business indicators, or systems health metrics, or the time required to perform main operational activities is useful to enable the team to implement concrete improvements that can be quantified and evaluated to make concrete business and technical impact.
What Is the Benefit of SRE?
The SRE model was created by Google with the goal of making it easier for developers to focus on the feature velocity and innovation while enabling system operators to focus on the areas of consistency and reliability. The model can be applied to any organization and its popularity has been growing over recent years.
The companies that have adopted the SRE model share interesting stories with many similarities about the SRE benefits and how they were able to translate those practices into a concrete positive business impact.
Raising the operational excellence bar in engineering teams
From an engineering team perspective, SRE benefits are very tangible from the very start. Traditionally, engineering teams are primarily concerned about application development and the ability to ship new features as fast as they can. Obviously, a modern and experienced team knows that is only part of the equation. Investing time in developing testing strategies, continuous integration and deployment workflows, and cloud automation, among other practices, contribute to a healthy software system.
The SRE mindset helps teams raise the bar of operational excellence by providing software engineering practices applied to their IT operations. These practices allow them to improve across several areas such as availability, latency, performance, and capacity.
As a discipline, SRE practices are focused in minimizing and making it gradually easier to operate and maintain software solutions and are able to cover different aspects of the entire software lifecycle.
A team that successfully adopts SRE practices will shift their operational workload to the day-to-day development tasks, embracing the engineering complexities that come with scale and new features rather than avoiding changes to their software solution.
Unified engineering vision and cross-team collaboration
When applied to multiple software engineering teams, the SRE mindset brings a unified engineering vision to the organization that promotes collaboration, knowledge sharing, and a common language across different teams.
Contrary to a new software library or a deployment tool, it’s not only up to the software engineering team to successfully adopt SRE. Business and other non-technical stakeholders have a tremendous influence in enabling and fostering a culture where the SRE mindset can thrive.
It is important to create a culture where engineering teams have psychological safety to fail, and learn from those failures, combined with a discussion among all stakeholders on what reliability really means to the business. Only a dialog between business and technical stakeholders is able to create the necessary alignment to define the different service levels—a key SRE concept—and understand their associated effort and business impact.
Key SRE Service Level Concepts
The key SRE service level concepts are:
- Service Level Indicators (SLIs) are one or more quantifiable reliability measures of the software solution from the perspective of your customers. In a web solution, good examples would be the HTTP status codes (2xx, 5xx, etc) and the overall end-to-end latency.
- Service Level Objectives (SLOs) are one or more targets for a specific SLI over a fixed period of time. In a web solution, that could be aiming for a maximum of 1% HTTP 5xx (server error) per month or achieving a under 200ms latency in every http request per day.
- Reliability is a value that can be obtained by simply dividing the number of successful actions (how many times it worked well) by the total number of actions.
- Error budget refers to the amount of unreliability that the stakeholders are willing to tolerate. In essence, it can be obtained by subtracting the reliability value from 100%.
While simple to understand, the SRE service levels concepts and the associated reliability and error budget values are incredibly useful in establishing that common language across the organization and creates a shared understanding between business, developers, and SRE engineers on defining the expectations and responsibilities. Equally important, setting a target of 100% reliability is utopian and wrong since it prevents the team from innovating, learning from mistakes and impacts the release speed of new features.
Operations as a Value Creation Center
The SRE model created a significant change in how business leaders perceive operational roles and practices in software engineering. Traditionally, business stakeholders only consider the development of new features in software engineering as valuable: inherent IT operations and other activities were treated as an inconvenient expense.
In today's modern software development, enlightened organizational leaders know that it's not enough to value the development and fine tuning of a great engine if it’s to be placed in an old rusty chassis. The whole software solution, throughout the different phases of its lifecycle, needs to be considered and valued accordingly.
With an SRE mindset and associated cultural shift and benefits that value became clearer. The ability to scale a software solution to meet unexpected traffic demands on special occasions (e.g., Black Friday) has less to do with the application business logic compared with the way system operations are handled and managed. Other examples such as cloud cost optimization or having adequate backup and disaster recovery strategies in place—which simultaneously reduce costs and risks at the same time they increase operational excellence—came to prove that operations is (and should be treated as) an important value creation center.
Site reliability engineers continuously drive structural improvements and have a combination of deep technical and customer centric skills that are hard to find. Thus, trying to reduce costs for an SRE team by cutting personnel or outsourcing it completely is often a misstep.
However, it is important to understand that not every software solution requires a dedicated SRE team or roles. Even at Google, SRE teams are optional. The development team can own and drive the SRE work if the scale of the solution or the maturity stage of the project does not require that level of support. Yet, it’s important to retain that the SRE mindset and practices still need to happen, and that culture should be fostered.
As a young and rapidly growing discipline, the SRE model can be found today in several organizations regardless of their industry and at different levels of digital transformation. While organizations that are more mature in their journey are able to reap more benefits, the SRE mindset and practices can (and should) be applied early on.
Several companies, such as LinkedIn, Twitter, Zalando, Facebook, Microsoft, Apple and Dropbox, have been paving the way by sharing their experiences and best practices. A great example comes from Uber with their tech talk about the history of SRE in the company. What is particularly interesting in this talk is seeing the evolution of the adoption of the SRE model across the company, from a few engineers to an entire organizational culture shift.
There are many other similar stories that can serve as examples for organizations that want to learn about the site reliability engineering mindset and start reaping SRE benefits. NetApp Cloud Volumes ONTAP, the data management platform for hybrid and multicloud deployments on AWS, Azure, or GCP, is one great way enterprises are gaining the tooling capabilities to make the SRE model a reality. Cloud Volumes ONTAP gives users a full DevOps toolkit, including:
- Automation and IaC tools and integration
- Instant, zero-capacity data cloning
- Persistent storage for containerized workloads
- Cost-cutting storage efficiencies
To find out more, read about all the SRE benefits that come with Cloud Volumes ONTAP.