DORA metrics are the most widely cited framework for measuring software delivery performance. Four numbers — deployment frequency, lead time for changes, change failure rate, and mean time to recovery — have become the standard language for engineering productivity conversations at every level of an organisation. This guide explains exactly what each metric measures, where the data comes from, what the research-backed benchmarks mean, and how teams commonly misuse them.

Where DORA Metrics Come From

DORA stands for DevOps Research and Assessment, a research programme that began in 2014 under Dr. Nicole Forsgren, Jez Humble, and Gene Kim. Their multi-year study surveyed tens of thousands of technology professionals and identified the specific practices and outcomes that separated high-performing engineering organisations from low-performing ones. The findings were published in the 2018 book Accelerate and continue to be updated annually in the State of DevOps Report, now maintained by Google Cloud.

The core finding: four metrics reliably predict both software delivery performance and organisational outcomes including employee burnout, commercial performance, and team retention. They're not arbitrary KPIs — they emerged from statistical analysis of what high performers actually do differently.

The Four Metrics

1. Deployment Frequency

What it measures: How often your team deploys code to production.

This is a direct measure of throughput. A team that deploys once a week is moving slower than one that deploys multiple times per day — not because they're lazier, but because their batch sizes are larger, their processes have more friction, and their feedback loops are longer.

How to collect it: Count production deployments over a period. Most CI/CD platforms (GitHub Actions, GitLab CI, Jenkins, ArgoCD) log every deployment with a timestamp. You can query deployment events directly or use a dedicated DORA dashboard in Google Cloud, LinearB, or Sleuth.

DORA benchmarks (2023 report):

  • Elite: Multiple times per day
  • High: Once per day to once per week
  • Medium: Once per week to once per month
  • Low: Less than once per month

What it doesn't measure: Quality. A team deploying broken code ten times a day is not elite. Deployment frequency is only meaningful alongside change failure rate.

2. Lead Time for Changes

What it measures: The time from code commit to that code running in production.

Lead time captures the end-to-end speed of your delivery pipeline. It includes time in code review, time waiting in a CI queue, time in QA or staging, time waiting for a deployment window, and time for the actual deployment. Every step that adds clock time without adding value shows up here.

How to collect it: Take the timestamp of the first commit in a change set and subtract it from the timestamp of the deployment that included it. The tricky part is linking commits to deployments accurately. Tools like Sleuth, LinearB, and Faros AI track this automatically by connecting your Git provider to your deployment platform. If you're doing it manually, a deployment log with commit SHAs is the starting point.

DORA benchmarks:

  • Elite: Less than one hour
  • High: One day to one week
  • Medium: One week to one month
  • Low: One to six months

Common misreading: Teams often measure lead time from ticket creation rather than first commit. This blurs delivery performance with product planning cycles — two very different things. DORA lead time starts at code, not idea.

3. Change Failure Rate

What it measures: The percentage of deployments that cause a failure in production requiring a hotfix, rollback, or patch.

This is the quality counterbalance to deployment frequency and lead time. You can deploy fast and frequently while still delivering high-quality changes — change failure rate tells you whether you're actually doing that. A team with high deployment frequency and high change failure rate has a testing and review problem, not a speed advantage.

How to collect it: Divide the number of deployments that resulted in a production incident requiring remediation by the total number of deployments. The definition of "failure" needs to be consistent — most teams use: a deployment that triggered a P1/P2 incident, required a rollback, or required an emergency hotfix within 24–48 hours.

DORA benchmarks:

  • Elite: 0–5%
  • High: 5–10%
  • Medium: 10–15%
  • Low: 46–60%

What it doesn't capture: Latent failures — bugs that exist in production but haven't triggered an incident yet. Change failure rate measures detected failures, not all failures.

4. Mean Time to Recovery (MTTR)

What it measures: How long it takes to restore service after a production failure.

MTTR reflects your entire incident response capability: observability tooling, on-call processes, runbooks, deployment rollback mechanisms, and team coordination. A team with good observability and practiced incident response will consistently outperform one that relies on ad-hoc debugging. The metric makes that difference visible.

How to collect it: Time from incident detection (alert fired or incident declared) to full service restoration. This data lives in your incident management tool (PagerDuty, OpsGenie, Incident.io). The start time is when the incident was opened; the end time is when it was marked resolved.

DORA benchmarks:

  • Elite: Less than one hour
  • High: Less than one day
  • Medium: One day to one week
  • Low: More than six months

Naming note: The 2021 DORA report renamed this to "Failed Deployment Recovery Time" to be more precise — it specifically measures recovery from deployment-caused failures rather than all incidents. Many teams still use the broader MTTR definition, which is fine as long as you're consistent.

How the Four Metrics Relate to Each Other

The four metrics form two pairs that balance each other.

Speed pair: Deployment frequency + lead time for changes. Together they measure how fast your team moves changes from code to production.

Stability pair: Change failure rate + MTTR. Together they measure how reliably your team delivers working software and recovers when things go wrong.

The research finding that surprised many practitioners: speed and stability are not in tension. High-performing teams deploy more frequently and have lower failure rates and faster recovery. The assumption that moving faster means breaking more things is a characteristic of low-maturity delivery processes, not an inherent trade-off.

This is why DORA metrics must always be viewed together. Optimising deployment frequency in isolation by skipping tests produces a team that looks fast on one metric and terrible on the other three.

How to Start Collecting Them

You don't need a specialised platform to get started. The minimal viable approach:

Deployment frequency: Add a step to your CI/CD pipeline that writes a row to a spreadsheet or database on every successful production deployment — date, service name, commit SHA. Run for 30 days and count.

Lead time: For each deployment, record the timestamp of the oldest commit it includes. Subtract from deployment time. Average over 30 days. Git log timestamps and your deployment log are all you need.

Change failure rate: After each deployment, record whether it resulted in a production incident requiring remediation within 48 hours. Divide incidents by total deployments monthly.

MTTR: Your incident management tool almost certainly logs open and close timestamps already. Export and calculate the average duration for production incidents over the past 90 days.

Once you have baseline numbers, decide whether dedicated tooling (Sleuth, LinearB, Faros, or the built-in DORA dashboards in GitLab and Google Cloud) is worth the investment.

Common Ways Teams Game These Metrics

Understanding the failure modes helps you design a measurement system that stays honest.

Splitting deployments artificially to inflate frequency. Deploying the same change as five micro-deployments instead of one does not make your team faster — it just makes the number bigger. Frequency should be measured at the meaningful change level, not at the pipeline execution level.

Narrowing the definition of "failure" to lower change failure rate. If incidents have to meet a very high severity bar to count, teams can run with persistent low-level failures that never get classified. Define failure consistently upfront, not after the numbers come back looking bad.

Closing incidents prematurely to reduce MTTR. Marking an incident resolved before full service restoration produces an artificially low number with no useful signal.

Measuring from commit merge instead of first commit for lead time. This hides time spent in code review, which is often where the most significant delays accumulate.

The metrics are most useful when the team collecting them is also the team using them. When they become targets for management reporting, the incentive to game them increases significantly.

What the Benchmarks Don't Tell You

The DORA performance bands (elite / high / medium / low) are useful for rough orientation but should not be treated as absolute targets.

A regulated financial services team deploying to core banking systems twice a month may be operating appropriately for their risk environment. A startup deploying ten times a day to a non-critical internal tool is not demonstrating more organisational capability than the bank.

The more useful comparison is your own team's trajectory over time. Are your lead times trending down? Is your change failure rate stable or improving as deployment frequency increases? Directional improvement within your context is more meaningful than hitting a benchmark band.

The Fifth Metric: Reliability

The 2021 State of DevOps Report added a fifth metric — reliability — defined as whether teams meet their availability and performance targets. Unlike the original four, reliability is an outcome metric rather than a process metric, and harder to operationalise consistently across different types of systems.

Many teams track it as SLO compliance: what percentage of the time did the system meet its defined availability and latency targets? If your SLO is 99.9% availability and you achieved 99.7%, that's a concrete reliability data point.

The fifth metric hasn't achieved the same adoption as the original four, but it closes an important gap: a team can have excellent DORA scores on the delivery metrics while still running a system that doesn't meet user expectations.

Key Takeaways

DORA metrics measure four things: how often you deploy (frequency), how fast changes move from code to production (lead time), how often deployments cause failures (change failure rate), and how quickly you recover when they do (MTTR).

They're research-backed, not invented by a vendor. Elite performers score well on all four simultaneously — the data consistently shows that speed and stability reinforce each other rather than trade off.

Start by collecting baselines with the data you already have in your CI/CD tool and incident management system. Use them to track your own improvement over time. Be precise about definitions — especially what counts as a "deployment" and what counts as a "failure" — before you start measuring, not after the numbers come back looking bad.