With companies rapidly adopting SRE to help manage the growing complexity of cloud environments, we've identified the SRE best practices you need to know.
Organizations everywhere are adopting site reliability engineering (SRE) to cope with the growing complexity of hybrid and cloud-native environments. Indeed, more than 1,000 solutions are now incubating in the Cloud Native Computing Foundation and modern applications comprise thousands of discrete microservices. Without SRE best practices, the observability landscape is too complex for any single organization to manage.
SRE is still in the early adoption stages. Like any evolving discipline, it is characterized by a lack of commonly accepted practices and tools. Organizations that have achieved SRE maturity have a better handle on the state of their infrastructure, the ability to tie reliability metrics more tightly to business objectives, and the means to ensure a consistent and responsive customer experience.
Dynatrace’s 2022 State of SRE Report surveyed 450 SREs across the globe. While 62% rated themselves as being well along the path toward SRE adoption, just one in five self-identified as SRE-mature. The following five key steps to success are essential for the journey to SRE maturity.
1. Make SRE accessible
More than half of all respondents cited two key SRE adoption barriers: the perceived difficulty of training existing IT professionals in SRE best practices, and the cost and difficulty of finding skilled professionals.
In a talent-constrained market, the best strategy could be to develop expertise from within the organization. Adopting tools with high levels of automation can help reduce the learning curve. Weaving DevOps and the related disciplines of DevSecOps and AIOps tightly into the development process can also accelerate the process. These reduce the need to hire specialists while providing a unified view of processes and infrastructure that make SRE more focused and effective.
2. Integrate toolchains and adopt an everything-as-code approach
Tools and practices tend to be fragmented in maturing markets. SRE is no exception, with practitioners and development teams each having their favorites. However, maintaining bespoke toolchains requires a significant investment of time and effort, and ultimately distracts SREs from more impactful tasks. Organizations should seek to standardize a single set of tools that everyone will use. Consider selecting platform-based solutions — whether open source or from a commercial vendor — that support open ecosystems.
Virtualization has revolutionized system administration by making it possible for software to manage systems, storage, and networks. By removing physical dependencies, automation can help perform SRE at scale. This can reduce labor costs and enhance reliability by enabling systems to self-heal. With the self-service features and an everything-as-code architecture, labor requirements will significantly decrease and SRE best practices will emerge.
3. Automate as much as possible
An overwhelming 85% of SREs said their ability to scale the practices requires automation and AI. Not surprisingly, more than 71% are increasing their use of automation across the lifecycle, 58% are applying automation to the continuous integration/continuous delivery (CI/CD) pipeline, and 46% are modernizing tool stacks.
Automation has the clear advantage of reducing manual labor. It also enables organizations to tie business-level objectives (BLOs) — such as user satisfaction and system responsiveness — to service-level objectives (SLOs), linking SRE best practices and bottom-line impact. Just over one in five survey respondents has gone beyond the automatic evaluation of SLOs to include BLOs. This number will likely increase as the SRE discipline matures.
4. Design, implement, and tune effective SLOs
While SLOs are the North Star of SRE, 99% of survey respondents stated they are running into challenges to define and create them. The chief inhibitors reported were information silos, incompatible toolsets, and growing complexity.
Consolidating on a single observability platform can help organizations solve these issues by reducing tool profusion and enabling the organization to create a set of performance standards that teams can be observe and manage in a unified way. SREs should agree upon SLO dashboards, error tolerances, remediation plans, and alerting tactics and test them in advance. The most popular SLOs among organizations with high levels of SRE proficiency relate to availability, user satisfaction survey scores, the ratio of failed requests to total requests, and the crash rate across all supported devices.
5. Apply AIOps for analysis and automation
AIOps — an AI-driven approach to managing IT operations such as monitoring, automation, and service desk — has become increasingly popular in the SRE lifecycle. Platforms are now offering the ability to gather data from multiple sources much faster than human operators, automatically remediate common problems, and reduce false or unnecessary alerts.
More than two-thirds of the SRE report respondents reported they’re increasing their use of AIOps across every part of the lifecycle. They’re keeping the following key drivers in mind:
- Automation of additional processes critical to ensuring service levels
- Improving the ability of teams to prioritize problems
- Identifying security vulnerabilities more quickly
- Enabling predictive monitoring to identify problems before causing a disruption.
SRE best practices may very well be in their infancy stages. But by taking these early steps now, your organization can be ahead of the competition in gaining complete insight into your entire application portfolio.
Eager to learn more about how SRE is evolving? Read Dynatrace’s 2022 State of SRE Report.
Looking for answers?
Start a new discussion or ask for help in our Q&A forum.
Go to forum