Site reliability engineering (SRE) using DevOps requires high levels of automation to accommodate rapid testing and release schedules.
When it comes to site reliability engineering (SRE) initiatives adopting DevOps practices, developers and operations teams frequently find themselves at odds with one another. Developers want to write high-quality code and deploy it quickly. Operations teams want to make sure the system doesn’t break.
Observability seeks to find a happy medium between the two. Observability is a set of practices and technologies that helps IT teams understand what’s happening across complex environments so that teams can detect and resolve issues quickly, without disruption to users.
Andreas Grabner, DevOps Activist at Dynatrace, took to the virtual stage at the recent Dynatrace Perform conference to describe how the open source Keptn project automates the configuration of observability tools, dashboards, and alerting based on service-level objectives (SLOs).
Keptn: A reference implementation of Google’s SRE principles
Keptn is an open source control plane that enables cloud-native continuous delivery and automated operations. The goal is to accelerate innovation by eliminating the need for custom automation scripts and point-to-point tool integrations.
Dynatrace developed and released Keptn to open source in 2020. Software engineer Taras Tsugrii of Meta (formerly Facebook) paid Keptn a high compliment, saying it feels like a reference implementation of Google’s SRE principles, which are the search giant’s techniques for ensuring the integrity of its sites and services. Keptn’s global user base is developing, as is adoption within companies including Citrix, Amasol, ERT, and Kitopi. One particular use case for Austrian banking software developer Raiffeisen involves using Keptn to automate the production release and readiness validation of all its products using scoring metrics.
One of Keptn’s key advantages is that it automates SLO-driven, multi-stage delivery of apps and services. Why is automated orchestration critical?
Too many SLOs create complexity for DevOps
DevOps and SRE engineers experience a lot of pressure to deliver applications faster and that adhere to standards like “the five nines” of availability, resulting in many new service level requirements. Developers also need to automate the release process to speed up deployment and reliability.
SLOs are a great way to define what software should do. But as teams adopt them more widely, they sometimes don’t break SLOs down into individual goals that make sense for development and release management.
By shifting SLOs left from operations back to development, teams can detect problems more quickly. SREs can then use SLOs for release quality checks, such as big bang, blue/green, and canary testing.
Bringing SLOs back into development also gives developers production-stage feedback on critical metrics that may later impact business SLOs. With instant feedback enabling teams to release clean software, developers can react faster and speed up the delivery of high-quality content.
With many pipelines to maintain, DevOps teams need automated orchestration. Standard automation techniques like scripting can be very powerful, but also quite complex.
Limits of scripting for DevOps and SRE
Classic automation has limits. Dieter Landenahuf, a senior ACE Engineer at Dynatrace, built Jenkins pipelines for new microservice architectures by creating templates and copying, pasting, and modifying them slightly. This created a classic “snowflake effect” because of the risk of code duplication: if something breaks, you need to fix it in multiple places. The process is error-prone, manual, and doesn’t scale.
Keptn reduces the complexity of pipelines while bringing automation into the delivery processes so that developers can focus on SREs.
Keptn presents a declarative way to define orchestration so that developers can create automation sequences based on what should happen for delivery, remediation, and testing. Developers can define any sequence with chains of preferred tools. Keptn retains the individual tool configurations under version control.
Optimizations are based on SLOs, meaning Keptn decides whether to move forward with an orchestration based on SLOs. Keptn doesn’t replace the tools developers prefer: it connects them. Since Keptn is an open-source project, it will be easier to eventually replace tools because they all adhere to the same event standard. Teams can plug in new tools for testing, deployment, or even monitoring without having to modify many existing automation tools.
Keptn lets developers keep their favorite tools
Developers can use their existing tools to build artifacts. If they need to add more tasks — such as enforcing SLOs, automating testing, or connecting an observability platform — Keptn handles that for them. Keptn orchestrates all tools so that developers don’t have to. It simply reaches out to monitoring platforms like Dynatrace to extract the necessary SLOs. There is no need to write or maintain any customizations, nor figure out when the job is done and how to parse the results.
Keptn includes best practices that help developers choose which sequences to use. Ultimately, Keptn reduces code automation by 90% and makes every component SLO-driven. It’s based on open standards and is fully declarative. Everything is code and version-controlled in GitOps.
Dynatrace has an enterprise version of Keptn as part of its cloud software-as-a-service (SaaS). Here are some of the advantages of using the enterprise version:
- Self-manages.
- Authenticates using Dynatrace single sign-on.
- Automatic upgrades.
- Enables developers to plug in additional tools.
Developers can stay in production sequences. Every time the system creates a new artifact, Keptn triggers a new delivery sequence using tools based on events that the developer has subscribed to. Keptn keeps all the configuration information for every state along with Helm scripts and SLOs. Developers can also create different sequences, such as deployment using Monaco.
SLO-based release validation made easy
Release validation centers around SLO-based evaluations. To check quality, developers can simply trigger a release that evaluates performance against SLOs. The dashboard provides a heat map and every service level indicator (SLI) metric, criteria, and scoring.
There are two configuration options. The first is to configure as code using YAML and upload the configuration to a Git repository. The second, for Dynatrace users, is the Dashboard Link, which defines a dashboard with all the relevant SLOs and technical metrics. The dashboard becomes the basis for automated validation — with all SLIs and SLOs stored in Dynatrace — and problems detected by Dynatrace’s Davis automated dependency analysis.
Developers can also configure Slack and send notifications from Keptn cloud automation every time the system orchestrates a process. They can have multiple tools that subscribe to events, as well as a webhook service that lets them send events to an external tool.
Charting the course with Keptn
Developers who have been building many automation scripts for SRE should take a look at Keptn. The current version is 0.12, and the roadmap includes high-availability improvements, role-based access controls, and improved GitOps and auto-remediation. A Continuous Delivery Foundation special interest group is now standardizing events and the team is preparing a proposal for incubation in the Cloud Native Computing Foundation. Dynatrace’s hosted cloud version of Keptn, delivered through Cloud Automation, has planned support for a custom execution plane and custom integrations.
Check out Andreas’ presentation here along with other sessions from Perform 2022.
Looking for answers?
Start a new discussion or ask for help in our Q&A forum.
Go to forum