As more organizations adopt cloud-based computing and the demand for digital services increases, site reliability engineering (SRE) practices have become essential. These practices help organizations meet service level agreements (SLAs) for availability, performance, user experience, and business KPIs.
But what exactly is SRE, and what do site reliability engineers do?
What is site reliability engineering?
Site reliability engineering (SRE) is the practice of applying software engineering principles to operations and infrastructure processes to help organizations create highly reliable and scalable software systems. As a discipline, SRE focuses on improving software system reliability across key categories including availability, performance, latency, efficiency, capacity, and incident response. Those who perform the tasks involved are known as site reliability engineers.
The term “site reliability engineering” was coined in 2003 by Google VP of Engineering Ben Sloss, who famously noted on his LinkedIn profile, “If Google ever stops working, it’s my fault.” According to Google, “SRE is what you get when you treat operations as a software problem.”
Although every organization and software system is unique, it’s important to understand the fundamentals of SRE — as well as the skills and mindset of its engineers — as you think about optimizing the reliability and overall quality of your software.
The benefits of Site Reliability Engineering
Site Reliability Engineering highlights reliability, scalability, and efficiency. The benefits of SRE include:
- Increased reliability and uptime: SRE focuses on preventing and mitigating incidents to ensure that systems and applications are always available and performant.
- Improved scalability: By optimizing resource usage and minimizing waste, SRE can help organizations scale their infrastructure and applications more efficiently.
- Improved user experience: SRE can ensure that applications and services are always available and responsive, which can directly impact customer satisfaction, brand reputation, and revenue.
- Continuous improvement: SRE highlights the use of data and metrics to identify areas for improvement and drive ongoing optimization and innovation.
- Increased security: SRE can help to guarantee that systems and applications are secure and compliant with industry standards and regulations.
- Predictable performance: By monitoring and analyzing usage patterns, SRE can help to predict and prevent performance issues before they occur, ensuring that systems and applications perform predictably and consistently.
- Cost savings: SRE can reduce costs by automating routine tasks and optimizing resource usage, reducing the need for manual intervention and saving time and money.
- Collaboration between development and operations teams: SRE emphasizes cross-functional teams and shared ownership of reliability and performance, promoting a culture of collaboration and accountability.
The benefits of Site Reliability Engineering
Five things to know about site reliability engineering
1. SRE focuses on automation
A major goal of SRE is to reduce duplication or redundancy of effort as much as possible. SRE teams focus on automating manual tasks, such as provisioning access and infrastructure, setting up accounts, and building self-service tools. That enables development teams to focus on delivering features, and operations teams can focus on managing infrastructure.
Automating processes is even more critical as organizations speed up delivery of new features into production. On one hand, speed comes from DevOps teams who leverage automation to increase continuous integration and continuous delivery (CI/CD). On the other hand, the move to microservice architectures and the adoption of cloud-native technology, containers, Kubernetes, and serverless architectures offer even more ways to delivery smaller changes faster. These methods increase efficiency and speed, but also demand consistent, repeatable processes that reduce risk and provide feedback loops for measuring operations, so teams can identify areas for improvement.
2. SRE bridges the gap between Dev and Ops
Everything the organization does in the value stream process should answer the question “how do we ensure this runs in production reliably?” SREs drive resiliency-based engineering. They can become mentors and ensure that resiliency is a top priority for both developers and operations.
Applying the DevOps mindset and skills to software reliability helps reduce silos between development and operations teams by sharing responsibility for detecting reliability and performance issues early in the development life cycle. Collaboration between developers, operations, and product owners enables site reliability engineers to define and meet uptime and availability targets.
3. SRE drives a “shift-left” mindset
SRE is a constantly evolving discipline, presenting opportunities to build methods, policies, and processes into the delivery pipeline that allow applications to “auto-remediate” or users to solve their own problems. A shift-left mindset means SREs can embed reliability principles from Dev to Ops, baking reliability and resiliency into each process, app, and code change to improve the quality of software that goes to production.
Here are some ways SRE helps to drive a “shift-left” mindset:
- Develop quality gates based on production-level service level objectives (SLOs) to detect issues earlier in the development cycle.
- Automate build testing and validation using service-level indicators (SLIs) and SLOs
- Influence architectural decisions during initial design stages to ensure resiliency and scale at the outset of software development.
The goal is to take early, proactive steps to ensure quality and reliability are built-in from the beginning. SRE can influence processes more broadly and expand to coordinating testing across the enterprise in support of CI/CD practices.
To learn more about how Dynatrace enables SRE with “shift-left SLIs,” join us for the on-demand performance clinic Automated SRE-driven performance engineering with Dynatrace.
4. SRE builds services and tools to help operations and support
Traditionally, a major goal of operations teams is to improve uptime. This single-dimensional approach looks for the coveted “five nines” of uptime, or 99.999%, which translates to just over five minutes of downtime per year.
But the higher frequency of change in distributed cloud-native environments requires a multi-dimensional approach.
The goal of SRE is to enable higher change rates while maintaining resiliency and that coveted 99.999% uptime. In multicloud environments, resiliency is measured across key metrics such as performance, user experience, responsiveness, conversion rates, and so on. SRE teams need to build and implement services that improve operations and facilitate the release process across all these areas. This can be anything from adjusting monitoring and alerting to making code changes in production. Site reliability engineers often build custom tooling from scratch to meet specific needs in the software delivery or incident management workflow.
Adopting an SRE approach also requires standardizing the technologies and tools teams use. Standardization makes it easier to manage operations and reduces the burden of managing incompatible technologies, which gives teams more time to collaborate and innovate.
5. SRE requires a cultural change
Because SRE is a practice, it requires changing how teams across multiple disciplines communicate, solve problems, and implement solutions. To adopt a successful SRE culture, organizations must adopt new approaches to managing risk. It also means they must adapt governance processes, invest in hiring, and educate a collaborative workforce that’s versed in engineering and operations and learns and adapts quickly.
Organizations can then integrate these skilled engineers at key points in the DevOps lifecycle. In development and testing teams, SRE specialists develop automation that helps developers test early and often without impeding agile delivery schedules. At a system level, SRE specialists develop tooling that coordinates releases and launches, evaluates system architecture readiness, and meets system-wide SLOs. At a governance level, SRE specialists help to define and oversee enterprise architecture, establish best practices, and select tools and resources that support company-wide site reliability.
What does a site reliability engineer do?
To get an expert’s view on what site reliability engineers do, I asked our DevOps Activist, Andi Grabner.
“Site reliability engineers use good practices around software engineering to provide resilient infrastructure and resilient services to their organizations and the people that actually deliver new applications,” he explains. He also notes that SREs often come from traditional operations roles, such as systems engineers who keep systems up and running. “Site reliability engineers ensure systems stay reliable, resilient, and available,” he adds.
State of SRE Report
We asked 450 SREs across various industries to share their unfiltered perspective into how site reliability engineering (SRE) is evolving as a discipline. The report uncovers the challenges SREs must overcome, and what the future of SRE looks like.
Typical expectations for SREs
Typically, SREs are tasked with ensuring that the speed of delivery doesn’t result in security, service, or solution interruption. But as Grabner notes, “Expectations are a little different for every company. There’s no golden rule. Many are responsible for monitoring and observability and maintaining systems — and providing automation to spin up required environments.”
Grabner highlights the role of SREs in providing the frameworks and platforms for service and application deployment. “When things go wrong, SREs often take on the role of first-line defenders if there’s an alert,” he says. “In a great organization, they don’t do it alone — they constantly work with and within individual application teams to deal with apps that are under fire.”
Perhaps the most important role of SREs is architecting resiliency. “You can’t buy resiliency as-a-service,” Grabner observes. “You have to architect for it by building systems that are resilient by design.” This architectural approach recently helped Dynatrace to withstand an AWS outage in Germany. Automatic delivery, resiliency, and auto-remediation helped to ensure critical systems weren’t affected.
Roles and responsibilities of an SRE
The role of an SRE has become invaluable as technology continues to evolve and businesses become more dependent on digital infrastructure. Here are some of the standard responsibilities of an SRE:
- Monitoring and Alerting
One of the primary duties of an SRE is to monitor a company’s digital infrastructure. This involves setting up monitoring tools and systems to detect concerns before they become significant problems. SREs set up alert systems that notify the appropriate people when issues are detected. - Incident Response
The SRE responds quickly and effectively when issues are detected by identifying the root cause, developing and implementing a plan, and communicating with relevant stakeholders. - Automation and Tooling
SREs develop and maintain the tools and systems used to manage a company’s digital infrastructure. This includes developing automation scripts to streamline processes and reduce the risk of human error. SREs also identify areas where tooling can be improved and create new tools to meet the changing needs of the business. - Capacity Planning
SREs ensure that a company’s digital infrastructure can meet the needs of the business. This involves analyzing usage patterns to predict and guarantee the required capacity to meet future demand. - Collaboration
SREs work closely with other teams to ensure the company’s digital infrastructure is reliable, scalable, and secure.
What makes a great SRE?
Great SREs are risk-takers, tinkerers, and innovators. They figure out what it takes to scale a system from 100 users to 100,000 users to 1,000,000 users while maintaining uptime and resiliency. They’re systems thinkers who consider how decisions made in development affect production environments, and how the needs of production systems can influence design.
This requires constant testing, accepting failure, and adapting, automating repeatable processes. Successful SREs bring a resiliency and adaptation mindset to every situation.
Grabner highlights the need for SREs to learn from their mistakes. “Some companies run ‘chaos days’ to deal with worst-case scenarios to understand what could happen and how to deal with it,” he says.
Automation is another marker of SRE success. “People excel in this role when they try to automate all the tasks that can and must be automated,” says Grabner. “This frees them up to deliver true innovation.” He notes that while anyone can innovate under the right circumstances, teams are often held back by the “toil” of manual and repetitive tasks. “Your goal is to automate yourself out of your current role and into your next role.”
Finally, Grabner made it clear that SREs can’t operate in isolation. “You need to let people educate themselves with new technologies and practices,” he says. “Show the world. Don’t keep secrets — be open and share your own learnings, along with learning from others. There are a lot of great conferences out there — it’s worth getting inspired by what others do and inspiring others with what you do.”
DevOps vs. SRE
Where DevOps teams focus on streamlining change, SREs help ensure these changes don’t increase overall failure rates. In effect, they’re two sides of the same coin: DevOps automates speed, while SRE automates reliability. “It’s a balance between speed and safety,” Grabner says.
He sees DevOps processes as moving left to right along the development life cycle, using automation to speed up new capabilities that are typically measured by deployment frequency and lead time for changes. In comparison, SRE moves right to left using production-level requirements in development, with a focus on limiting failure rates and reducing the time required to restore service. “SRE is about making sure that even though there is a lot of change, these changes don’t break things.”
Grabner sees SRE and DevOps overlapping when it comes to SLOs. “SLOs are all about supporting business goals,” he said. “Companies may need systems at 99% reliability. They may want to increase their user base or improve the end-user experience.” Satisfying these goals is the role of DevOps. “But underneath these goals are technical goals that are specific to your objectives,” he says. “They contribute to business success with the right features at the right time and help you cope with change.” Delivering on these goals is the job of SRE staff. As a result, “SLOs are a great way to bring DevOps and SREs together.”
In another blog post you can learn more about the differences and similarities of SRE vs DevOps.
Solving for site reliability
Site reliability isn’t and never will be a “solved problem”. New services and applications combined with evolving enterprise demands mean there’s always work for SRE teams, and there’s always room for improvement.
As Grabner notes, when it comes to improving SRE impact, “The biggest thing is to be open and share your own learnings, along with learning from others. There are a lot of great conferences out there — it’s worth getting inspired by what others do and inspiring others with what you do.” He also highlights the need to learn from your mistakes. “Some companies run ‘chaos days’ to deal with worst-case scenarios to understand what could happen and how to deal with it.” Finally, Grabner made it clear that SREs can’t operate in isolation. “You need to let people educate themselves with new technologies and practices. Show the world. Don’t keep secrets — don’t see it as a silo.”
Looking for a solution to up-level your SRE practices? Dynatrace can help. Providing automatic and intelligent observability for even the most complex distributed cloud environments, the Dynatrace Software Intelligence Platform empowers SRE and DevOps teams to identify problems before they occur. Driven by continuous automation with AI at its core, Dynatrace delivers precise root-cause answers to site reliability issues at every step of the software development lifecycle. From early development in pre-production environments through delivery and operations in production environments, Dynatrace helps SRE teams improve reliability, availability, and latency, and mitigate the business impact of service outages and slowdowns.
Watch webinar
You can also join us for the on-demand webinar The State of SRE in 2022 on to learn more about SRE trends in 2022, how to become a better SRE and hear from three panelists about the state of SRE in their organizations.
Looking for answers?
Start a new discussion or ask for help in our Q&A forum.
Go to forum