Back to Blog
ObservabilitySLOsSREMonitoringAlert DesignDevOps

Observability-Driven Development — SLOs, Error Budgets, and Alert Design

A practical guide to building reliability into systems from day one: defining SLIs that measure what users experience, writing SLOs with meaningful targets, calculating error budgets, designing symptom-based alerts with burn rate thresholds, and implementing multi-window multi-burn-rate alerting with Prometheus and OpenTelemetry.

2026-04-24

Why Monitoring Servers Is Not Observability

Most teams know when a server is down. They get paged at 3 AM because CPU hit 95% or a health check failed. What they do not know — until a user emails support — is that checkout has been returning errors for 40% of requests for the past six hours while every individual service metric looked green.

This is the core gap that Observability-Driven Development (ODD) addresses. Instead of instrumenting infrastructure and hoping it correlates to user experience, ODD starts with a question: what does it mean for this system to work correctly from a user's perspective? That question produces Service Level Indicators (SLIs). The targets you set on those indicators become Service Level Objectives (SLOs). The gap between 100% and your SLO target is your error budget — the operational currency that governs how fast you can move and what incidents to prioritize.

ConceptDefinitionWho owns it
SLIA quantitative measure of service behavior (e.g., request success rate)Engineering
SLOA target for an SLI over a rolling time window (e.g., 99.5% over 30 days)Engineering + Product
SLAA contractual commitment to customers, typically looser than the internal SLOBusiness + Legal
Error BudgetAllowed failure headroom = 1 − SLO target. Budget consumed by incidents and risky deployments.Engineering + Product

Defining SLIs That Measure User Experience

A good SLI answers the question: "was this interaction with my service good for the user?" The canonical framework from the Google SRE Workbook models this as a ratio of good events to total events. A request is a "good event" if it was fast enough and returned a non-error response. Everything else is a bad event that consumes error budget.

Three SLI categories cover the vast majority of services: availability (did it respond?), latency (was it fast enough?), and quality (was the response correct and complete?). For data pipelines, add freshness and correctness as first-class SLIs.

Availability SLI

Proportion of requests that returned a non-5xx HTTP response (or equivalent domain-specific success criterion). Exclude health check endpoints and synthetic probes — count only real user traffic. In Prometheus: sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

Latency SLI

Proportion of requests that completed within a latency threshold that users consider acceptable. Define the threshold from user research or product requirements — not from what your system currently achieves. A common starting point: p99 < 500ms for interactive APIs. Track at multiple percentiles (p50, p95, p99) because p99 degradation is invisible in averages.

Pipeline Freshness SLI

For batch and streaming data pipelines: proportion of time windows where output data arrived within an expected lag threshold. If your dashboard is supposed to refresh every 5 minutes, a freshness SLI measures what fraction of 5-minute windows actually delivered on time. Stale data is a correctness failure, not just a performance issue.

Note

Avoid defining SLIs on infrastructure metrics (CPU, memory, disk I/O). These are useful for root cause analysis but they do not directly measure user experience. A pod can be at 90% CPU while serving requests normally, and a pod at 10% CPU can be returning 500s. SLIs must reflect what users see.
# Prometheus recording rules — compute SLI ratios for efficient alerting
# Store these in a dedicated sli-recording-rules.yml file

groups:
  - name: sli_availability
    interval: 30s
    rules:
      # Good requests in the last 5m (used as the base for burn rate math)
      - record: sli:http_availability:rate5m
        expr: |
          sum(rate(http_requests_total{status!~"5..",job="api"}[5m]))
          /
          sum(rate(http_requests_total{job="api"}[5m]))

      # 30-day rolling window availability (for SLO dashboard)
      - record: sli:http_availability:rate30d
        expr: |
          sum(rate(http_requests_total{status!~"5..",job="api"}[30d]))
          /
          sum(rate(http_requests_total{job="api"}[30d]))

  - name: sli_latency
    interval: 30s
    rules:
      # Proportion of requests completing under 500ms threshold
      - record: sli:http_latency_500ms:rate5m
        expr: |
          sum(rate(http_request_duration_seconds_bucket{le="0.5",job="api"}[5m]))
          /
          sum(rate(http_request_duration_seconds_count{job="api"}[5m]))

Writing SLOs with Meaningful Targets

The instinct when setting an SLO target is to aim for five nines (99.999%). Resist it. A 99.999% availability SLO over 30 days allows 26 seconds of downtime. For most services, this is not only unachievable — it is actively harmful because it means every deployment, every experiment, and every dependency failure instantly burns budget and triggers an incident response. The right SLO is the lowest target that users will not notice you missing.

Start from data, not intuition. Look at your current error rate over the past 90 days, find the percentile that represents your normal operating state (not incident peaks), subtract a small buffer for improvement headroom, and set that as your initial SLO. Tighten it as your reliability engineering matures.

SLO Target30-day error budgetSuitable for
99.9%43.8 minutesInternal tools, batch APIs
99.5%3.6 hoursConsumer apps, payment flows
99.0%7.3 hoursNon-critical microservices
95.0%36.5 hoursExperimental features, dev environments

Note

Set your internal SLO 10–20% tighter than your SLA. If you commit to 99.5% availability in contracts, operate to a 99.9% internal SLO. The gap gives you time to detect breaches and course-correct before violating customer commitments. Never expose your internal SLO targets in contracts — they are engineering targets, not promises.

Error Budgets: The Currency of Reliability

An error budget makes the trade-off between reliability and velocity explicit and quantitative. A team with a 99.9% SLO over 30 days starts each month with 43.8 minutes of allowed downtime. Every incident, risky deployment, and infrastructure failure consumes from that budget. When the budget is exhausted, the implicit agreement is: stop shipping features, focus on reliability.

This is the mechanism that makes SLOs operationally useful rather than just aspirational. Without error budgets, reliability and velocity are two teams arguing. With error budgets, both teams share a number: how much of the month's allowance is left. Product can make an informed decision about whether a risky release is worth the budget cost; engineering can quantify the reliability debt.

# Error budget consumption query in Prometheus
# Shows percentage of 30-day error budget remaining

(
  1
  -
  (
    1 - sli:http_availability:rate30d
  )
  /
  (1 - 0.999)   # Replace 0.999 with your SLO target
) * 100

# Budget remaining as absolute minutes (30-day window = 43200 minutes)
(
  (1 - 0.999) * 43200   # Total budget in minutes
  -
  (
    sum(increase(http_requests_total{status=~"5..",job="api"}[30d]))
    /
    sum(increase(http_requests_total{job="api"}[30d]))
    * 43200
  )
)

Error budget policies answer the question: "what does the team do when the budget hits a threshold?" Define these thresholds in advance, in writing:

>50%

Normal operations

Ship features at normal velocity. Reliability investment proportional to roadmap share.

25–50%

Caution zone

Defer high-risk changes. Require post-mortems for all incidents. Engineering lead must approve deployments during business-critical windows.

0–25%

Feature freeze

No new features. Only reliability, on-call fixes, and security patches. All eng bandwidth on reliability improvements until budget recovers.

0%

Budget exhausted — escalation

Incident retro required. No deployments without explicit engineering director approval. SLO target review: either improve reliability or reset to a realistic target.

Alert Design: From Noise to Signal

Most alert configurations in the wild violate one fundamental principle: they page on causes (CPU > 80%, memory at 90%, pod restart) instead of symptoms (users are experiencing errors, latency is degraded). Cause-based alerts create alert fatigue — engineers get paged dozens of times for infrastructure events that never affect users. When the real outage hits, the alert is lost in the noise.

The Google SRE Workbook recommends a two-tier model: page on symptoms (SLO burn rate alerts), ticket on causes (infrastructure anomalies that warrant investigation but not immediate human response). This dramatically reduces overnight pages while catching real user-facing incidents faster.

Multi-Window Multi-Burn-Rate Alerting

A burn rate tells you how fast you are consuming your error budget relative to normal. A burn rate of 1 means you are consuming budget exactly as fast as your SLO allows. A burn rate of 14.4 means you will exhaust a 30-day budget in 50 hours — a real incident. A burn rate of 720 means the budget is gone in 1 hour — a P1.

The multi-window variant adds a short confirmation window to each alert to reduce false positives. Only alert if burn rate is high in both a long window (catches sustained degradation) and a short window (confirms it is still happening). This catches slow burns that simple threshold alerts miss while avoiding pages from brief traffic spikes.

# Multi-window multi-burn-rate SLO alerts
# Based on the Google SRE Workbook pattern

groups:
  - name: slo_alerts
    rules:

      # --- P1: 2% budget in 1 hour (burn rate 14.4×) ---
      # Long window: 1h, Short window: 5m
      - alert: SLOBurnRateCritical
        expr: |
          (
            (1 - sli:http_availability:rate1h) / (1 - 0.999) > 14.4
          )
          and
          (
            (1 - sli:http_availability:rate5m) / (1 - 0.999) > 14.4
          )
        for: 2m
        labels:
          severity: page
          team: on-call
        annotations:
          summary: "Critical SLO burn — {{ $labels.job }}"
          description: >
            Availability SLO burning at >14.4× the allowed rate.
            At this rate the 30-day error budget will be exhausted in ~1 hour.
            Current 1h availability: {{ printf "%.2f" (mul (index $value 0) 100) }}%

      # --- P2: 5% budget in 6 hours (burn rate 6×) ---
      # Long window: 6h, Short window: 30m
      - alert: SLOBurnRateHigh
        expr: |
          (
            (1 - sli:http_availability:rate6h) / (1 - 0.999) > 6
          )
          and
          (
            (1 - sli:http_availability:rate30m) / (1 - 0.999) > 6
          )
        for: 15m
        labels:
          severity: page
          team: on-call
        annotations:
          summary: "High SLO burn — {{ $labels.job }}"
          description: >
            Availability SLO burning at >6× the allowed rate.
            Budget will be exhausted in ~6 hours if unaddressed.

      # --- Warning: 10% budget in 3 days (burn rate 1×) ---
      # Not a page — creates a ticket / Slack alert
      - alert: SLOBurnRateElevated
        expr: |
          (1 - sli:http_availability:rate3d) / (1 - 0.999) > 1
        for: 1h
        labels:
          severity: warning
          team: engineering
        annotations:
          summary: "Elevated SLO burn — {{ $labels.job }}"
          description: >
            Availability is below SLO target over the past 3 days.
            Not an immediate incident, but reliability attention needed.

Note

You also need recording rules for the intermediate windows (1h, 6h, 30m, 3d) used in the alert expressions above. Add these to your sli-recording-rules.yml using the same pattern as the 5m and 30d rules shown earlier. Recording rules precompute expensive range queries, keeping alert evaluation fast even at scale.

Latency Burn Rate Alerts

Apply the same multi-window pattern to latency SLIs. The good-event ratio for latency is: requests completing under the threshold divided by total requests. Prometheus histograms make this straightforward with the _bucket metric and the le label.

groups:
  - name: sli_latency_windows
    interval: 30s
    rules:
      - record: sli:http_latency_500ms:rate1h
        expr: |
          sum(rate(http_request_duration_seconds_bucket{le="0.5",job="api"}[1h]))
          /
          sum(rate(http_request_duration_seconds_count{job="api"}[1h]))

      - record: sli:http_latency_500ms:rate5m
        expr: |
          sum(rate(http_request_duration_seconds_bucket{le="0.5",job="api"}[5m]))
          /
          sum(rate(http_request_duration_seconds_count{job="api"}[5m]))

      - record: sli:http_latency_500ms:rate6h
        expr: |
          sum(rate(http_request_duration_seconds_bucket{le="0.5",job="api"}[6h]))
          /
          sum(rate(http_request_duration_seconds_count{job="api"}[6h]))

  - name: slo_latency_alerts
    rules:
      - alert: LatencySLOBurnRateCritical
        expr: |
          (
            (1 - sli:http_latency_500ms:rate1h) / (1 - 0.99) > 14.4
          )
          and
          (
            (1 - sli:http_latency_500ms:rate5m) / (1 - 0.99) > 14.4
          )
        for: 2m
        labels:
          severity: page
        annotations:
          summary: "Latency SLO burn critical — {{ $labels.job }}"
          description: >
            p99 latency SLO burning at >14.4×. Check for slow queries,
            upstream dependency degradation, or traffic spikes.

Instrumenting with OpenTelemetry for SLO-Ready Metrics

SLO alerting is only as good as the underlying metrics. OpenTelemetry provides a vendor-neutral instrumentation layer that produces the histogram and counter metrics Prometheus needs for SLI computation. The key is instrumenting at the correct granularity: per-endpoint, per-status-class, with histogram buckets at your latency SLO threshold.

from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import (
    ConsoleMetricExporter,
    PeriodicExportingMetricReader,
)
from opentelemetry.exporter.prometheus import PrometheusExporter
import time


# Configure OpenTelemetry with a Prometheus exporter
exporter = PrometheusExporter(port=9090)
reader = PeriodicExportingMetricReader(exporter, export_interval_millis=15000)
provider = MeterProvider(metric_readers=[reader])
metrics.set_meter_provider(provider)

meter = metrics.get_meter("api.slo", version="1.0.0")

# Counter for availability SLI
request_counter = meter.create_counter(
    "http_requests_total",
    description="Total HTTP requests by status class",
    unit="1",
)

# Histogram for latency SLI
# Buckets must include your SLO threshold (0.5s here)
request_duration = meter.create_histogram(
    "http_request_duration_seconds",
    description="Request duration in seconds",
    unit="s",
    explicit_bucket_boundaries_advisory=[
        0.005, 0.01, 0.025, 0.05, 0.1, 0.25,
        0.5,   # SLO threshold — keep this exact value
        1.0, 2.5, 5.0, 10.0,
    ],
)


def record_request(endpoint: str, method: str, status_code: int, duration_s: float) -> None:
    """Record a single request for SLI computation."""
    status_class = f"{status_code // 100}xx"

    request_counter.add(
        1,
        attributes={
            "endpoint": endpoint,
            "method": method,
            "status": str(status_code),
            "status_class": status_class,
        },
    )

    request_duration.record(
        duration_s,
        attributes={
            "endpoint": endpoint,
            "method": method,
            "status_class": status_class,
        },
    )


# Middleware pattern (FastAPI / Starlette)
import asyncio
from starlette.middleware.base import BaseHTTPMiddleware
from starlette.requests import Request


class SLOMiddleware(BaseHTTPMiddleware):
    async def dispatch(self, request: Request, call_next):
        start = time.perf_counter()
        response = await call_next(request)
        duration = time.perf_counter() - start

        record_request(
            endpoint=request.url.path,
            method=request.method,
            status_code=response.status_code,
            duration_s=duration,
        )
        return response

SLO Dashboard: What to Show and Why

An SLO dashboard has one job: give an engineer landing on it within 30 seconds a clear picture of whether the service is within SLO, how fast the error budget is burning, and how much budget remains in the window. It should not require reading six charts to form an opinion.

Structure the dashboard into three rows: current state (live SLI values and burn rates), budget health (rolling 30-day budget remaining + burn rate trend), and incident history (annotations marking deploys and incidents). Use Grafana Stat panels with threshold coloring for instant status — green is within SLO, red is burning.

# Grafana dashboard — key panel queries

# Panel: Current availability (Stat, threshold: red < SLO target)
sum(rate(http_requests_total{status!~"5..",job="$job"}[5m]))
/
sum(rate(http_requests_total{job="$job"}[5m]))

# Panel: Current burn rate (Stat, threshold: red > 1)
(1 - sli:http_availability:rate1h{job="$job"})
/
(1 - 0.999)

# Panel: Error budget remaining % (Gauge, 0-100, thresholds 25/50)
(
  1
  -
  (1 - sli:http_availability:rate30d{job="$job"})
  /
  (1 - 0.999)
) * 100

# Panel: Budget burn over time (Time series — shows burn rate history)
(1 - sli:http_availability:rate1h{job="$job"})
/
(1 - 0.999)

# Panel: p50 / p95 / p99 latency over time
histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket{job="$job"}[5m])) by (le))
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job="$job"}[5m])) by (le))
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job="$job"}[5m])) by (le))

Note

Add Grafana annotations for every deployment and incident. When investigating a burn rate spike, the first question is always "did someone deploy?" Annotations answer it without requiring a git blame or Slack search. Use the Grafana Annotations API in your CI/CD pipeline to post deploy markers automatically.

Production Checklist

Define SLIs before writing alerts

For every service in production, document what a "good request" means — success criteria and latency threshold. Store SLI definitions in a YAML file alongside your recording rules so the definition and implementation stay in sync. Without a written SLI definition, your SLO target is just a number without meaning.

Use recording rules for all SLI ratio queries

Range-vector queries over 30d are expensive at evaluation time. Recording rules precompute the ratio every 30s and store the scalar, making multi-window burn rate alert evaluation near-zero cost. Without recording rules, a cluster under load will lag alert evaluation — and you will miss incidents.

Page on burn rate, ticket on causes

Move all infrastructure alerts (CPU, memory, disk, pod restarts, queue depth) to non-paging notification channels — Slack, ticketing systems, dashboards. Reserve on-call pages for SLO burn rate alerts only. Measure and publish your monthly on-call page count. If it exceeds 5 pages per engineer per week, you have a false positive problem that needs systematic review.

Write error budget policies before incidents happen

Define in writing what the team does at 50%, 25%, and 0% budget remaining. Get sign-off from product and engineering leadership. Without pre-approved policies, every budget conversation during an incident becomes a political negotiation. The policies also force teams to discuss the trade-off between reliability and velocity explicitly, usually surfacing misaligned assumptions.

Review SLO targets quarterly

SLO targets should evolve as your reliability improves. A target you set at launch based on your initial error rate may be achievable at 99.5× cost months later. Review quarterly: if you never breach your SLO, the target is probably too loose (wasting reliability investment); if you breach it monthly, it is too tight (causing constant alert fatigue). Aim for a target you breach once or twice per year.

Run post-mortems on every SLO breach, not every incident

Many incidents happen in the error budget — that is what the budget is for. Run post-mortems on incidents that breached the SLO target, not on every alert or on-call page. This keeps post-mortem quality high (engineers take them seriously) and focuses learning on the incidents that matter to users. Template post-mortems should include an error budget section showing how much budget the incident consumed.

Building observability infrastructure or struggling with noisy alerts and undefined reliability targets?

We design SLO frameworks, alert hierarchies, and observability pipelines for production systems — from SLI definition and Prometheus rule configuration to dashboards, burn rate alerts, and error budget workflows. Let’s talk.

Get in Touch

Related Articles