Header background

SLOs done right: how DevOps teams can build better service-level objectives

In their Perform 2023 session, Andreas Grabner and Michael Cabrera offer advice for companies looking to build better service level objectives (SLOs).

Service-level objectives help IT teams define technical success and align with top-line business objectives.

But not all service-level objectives (SLOs) are created equal. So how do development and operations (DevOps) teams and site reliability engineers (SREs) distinguish among good, great, and suboptimal SLOs? In the 2023 Perform session “SLOs done right: A practitioners guide,” Michael Cabrera, SRE lead at Vivint, and Andreas Grabner, DevSecOps activist at Dynatrace, break down the state of SLOs and discuss how teams can adopt successful SLOs, avoid less-than-ideal objectives, and ultimately build better SLOs.

The state of service-level objectives

Download the State of SRE Report 2022

While SLOs play a critical role in helping DevOps and SRE teams align technical objectives with business goals, they’re not always easy to define. In many cases, issues are tied to operational overload. Enterprises now have access to myriad metrics they can track and measure, but an abundance of choice doesn’t equal actionable insight.

Indeed, 54% of SREs say they handle too many metrics, making it increasingly difficult to find the most relevant ones for a particular service, according to the Dynatrace State of SRE Report. 22% of SREs said they don’t know what makes a good SLO, while 18% said they don’t know which metrics to track. The same number said they’re not sure how or where to get started.

The result? From issues with an overwhelming set of metrics to problems with implementation, 99% of SREs said they encounter challenges when defining and creating SLOs.

The four components of a good SLO

The power of the SLO ensures global service levels

To help SREs and DevOps teams create and deploy effective SLOs, Cabrera and Grabner examine the following four factors that make a good SLO:

1. Monitors signals

The first attribute of a good SLO is the ability to monitor the four “golden signals”: latency, traffic, error rates, and resource saturation. While not every SLO must report on all these signals simultaneously, good service-level objectives help teams uncover and understand how at least one of the four signals can affect current operations.

2. Includes the end user

While SLOs for back-end silos and processes help eliminate internal issues, back-end-based objectives tell only half the story. As a result, effective SLOs must account for end users. “You might have an internal dashboard where everything is green and it makes you feel good,” Grabner noted. “But on the outside, people are suffering. That means you can’t define SLOs purely on back-end metrics.”

3. Bridges the business gap

Good SLOs should also help bridge the gap between chief executive officers’ concerns and those of chief technology officers. By providing data relevant to both parties — such as user adoption metrics for CEOs and application crash data for CTOs — organizations can find common ground.

4. Prioritizes incident response

SLOs should target problems that represent potential business impact. This enables teams to prioritize incident response and ensure immediate issues are resolved before working to address more long-term concerns.

SLOs that teams should avoid

Don't just alert on metrics

Any SLO that accurately measures and reports data can benefit organizations. In practice, however, SLOs’ value varies significantly based on how teams design, deploy, and manage them.

Grabner and Cabrera offer the example of an iOS app experiencing crash issues after a team deploys a new version. Dynatrace OneAgent provided information about failure rates, latency, and throughput, along with iOS data for users, crashes, and error rates. Creating objectives tied to these metrics alone, however, didn’t benefit end users. Put simply, these metrics are the symptom rather than the root cause; knowing how many crashes have taken place and how often provides evidence that applications or other IT systems aren’t working as intended, but it doesn’t offer insight to help fix the problem at scale.

As a result, SLOs can’t simply alert on metrics. Instead, they must alert on both front- and back-end processes to help teams understand where problems occur and how they’re related, setting the stage for meaningful correction.

How to build a better SLO

We run SLO workshops with our customers

So, how do teams build a better SLO?

Building better SLOs starts by understanding the dual nature of these service-level objectives. To be effective, SLOs must deliver business and technical results. Cabrera and Grabner cited the example of a tandem approach in monitoring application performance. On the technical side, SLOs account for metrics such as availability, response time, and error rate. On the consumer side, SLOs focus on application adoption, overall application rating, and the total number of mobile crashes.

By combining these data sets, SRE and DevOps teams can make decisions that improve back-end performance and end-user experience. When it comes to creating a better SLO, “Make it visible and tangible,” Grabner advised.

It’s also worth considering implementing automation to help scale SLOs and encourage shift-left operations that allow teams to quickly and easily build their own SLO frameworks.

Ready to get SLOs done right? Check out the full session: “SLOs done right: a practitioner’s guide”

[cf]skyword_tracking_tag[/cf]