What are the four main DevOps metrics? DORA’s Four Keys
Google’s DevOps Research and Assessment (DORA) team established known as “The Four Keys.” Through six years of research, the DORA team identified these four key metrics as those that indicate the performance of a DevOps team, ranking them from “low” to “elite,” where elite teams are twice as likely to meet or exceed their organizational performance goals. Let’s dive into how these DevOps KPIs can help your team perform better and deliver better code.
Deployment frequency
Deployment frequency measures how often a team successfully releases to production.
As more organizations adopt continuous integration/continuous delivery (CI/CD), teams can release more frequently, often multiple times per day. A high deployment frequency helps organizations deliver bug fixes, improvements, and new features more quickly. It also means developers can receive valuable real-world feedback more quickly, which enables them to prioritize fixes and new features that will have the most impact.
Deployment frequency measures both long-term and short-term efficiency. For example, by measuring deployment frequency daily or weekly, you can determine how efficiently your team is responding to process changes. Tracking deployment frequency over a longer period can indicate whether your deployment velocity is improving over time. It can also indicate any bottlenecks or service delays that need to be addressed.
Lead time for changes
Lead time for changes measures the amount of time it takes for committed code to get into production.
This metric is important for understanding how quickly your team responds to specific application-related issues. Shorter lead times are generally better, but a longer lead time doesn’t always indicate an issue. It could just indicate a complex project that naturally takes more time. Lead time for changes helps teams understand how effective their processes are.
To measure lead time for changes, you need to capture when the commit happened and when deployment happened. Two important ways to improve this metric are to implement quality assurance testing throughout multiple development environments and to automate testing and DevOps processes.
Change failure rate
Change failure rate measures the percentage of deployments that result in a failure in production that requires a bug fix or roll-back.
Change failure rate looks at how many deployments were attempted and how many of those deployments resulted in failures when released into production. This metric gauges the stability and efficiency of your DevOps processes. To calculate the change failure rate, you need the total count of deployments, and the ability to link them to incident reports resulting from bugs, labels on GitHub incidents, issue management systems, and so on.
A change failure rate above 40% can indicate poor testing procedures, which means teams will need to make more changes than necessary, eroding efficiency.
The goal behind measuring change failure rate is to automate more DevOps processes. Increased automation means released software that’s more consistent and reliable and more likely to be successful in production.
Mean time to restore service
Mean time to restore (MTTR) service measures how long it takes an organization to recover from a failure in production.
In a world where 99.999% availability is the standard, measuring MTTR is a crucial practice to ensure resiliency and stability. In the case of unplanned outages or service degradations, MTTR helps teams understand what response processes need improvement. To measure MTTR, you need to know when an incident occurred and when it was effectively resolved. For a clearer picture, it’s also helpful to know what deployment resolved the incident and to analyze user experience data to understand whether service has been restored effectively.
For most systems, an optimum MTTR could be less than one hour while others have an MTTR of less than one day. Anything that takes more than a day could indicate poor alerting or poor monitoring and can result in a larger number of affected systems.
To achieve quick MTTR metrics, deploy software in small increments to reduce risk and deploy automated monitoring solutions to preempt failure.