Why Monitoring Servers Is Not Observability
Most teams know when a server is down. They get paged at 3 AM because CPU hit 95% or a health check failed. What they do not know — until a user emails support — is that checkout has been returning errors for 40% of requests for the past six hours while every individual service metric looked green.
This is the core gap that Observability-Driven Development (ODD) addresses. Instead of instrumenting infrastructure and hoping it correlates to user experience, ODD starts with a question: what does it mean for this system to work correctly from a user's perspective? That question produces Service Level Indicators (SLIs). The targets you set on those indicators become Service Level Objectives (SLOs). The gap between 100% and your SLO target is your error budget — the operational currency that governs how fast you can move and what incidents to prioritize.
| Concept | Definition | Who owns it |
|---|---|---|
| SLI | A quantitative measure of service behavior (e.g., request success rate) | Engineering |
| SLO | A target for an SLI over a rolling time window (e.g., 99.5% over 30 days) | Engineering + Product |
| SLA | A contractual commitment to customers, typically looser than the internal SLO | Business + Legal |
| Error Budget | Allowed failure headroom = 1 − SLO target. Budget consumed by incidents and risky deployments. | Engineering + Product |
Defining SLIs That Measure User Experience
A good SLI answers the question: "was this interaction with my service good for the user?" The canonical framework from the Google SRE Workbook models this as a ratio of good events to total events. A request is a "good event" if it was fast enough and returned a non-error response. Everything else is a bad event that consumes error budget.
Three SLI categories cover the vast majority of services: availability (did it respond?), latency (was it fast enough?), and quality (was the response correct and complete?). For data pipelines, add freshness and correctness as first-class SLIs.
Availability SLI
Proportion of requests that returned a non-5xx HTTP response (or equivalent domain-specific success criterion). Exclude health check endpoints and synthetic probes — count only real user traffic. In Prometheus: sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
Latency SLI
Proportion of requests that completed within a latency threshold that users consider acceptable. Define the threshold from user research or product requirements — not from what your system currently achieves. A common starting point: p99 < 500ms for interactive APIs. Track at multiple percentiles (p50, p95, p99) because p99 degradation is invisible in averages.
Pipeline Freshness SLI
For batch and streaming data pipelines: proportion of time windows where output data arrived within an expected lag threshold. If your dashboard is supposed to refresh every 5 minutes, a freshness SLI measures what fraction of 5-minute windows actually delivered on time. Stale data is a correctness failure, not just a performance issue.
Note
# Prometheus recording rules — compute SLI ratios for efficient alerting
# Store these in a dedicated sli-recording-rules.yml file
groups:
- name: sli_availability
interval: 30s
rules:
# Good requests in the last 5m (used as the base for burn rate math)
- record: sli:http_availability:rate5m
expr: |
sum(rate(http_requests_total{status!~"5..",job="api"}[5m]))
/
sum(rate(http_requests_total{job="api"}[5m]))
# 30-day rolling window availability (for SLO dashboard)
- record: sli:http_availability:rate30d
expr: |
sum(rate(http_requests_total{status!~"5..",job="api"}[30d]))
/
sum(rate(http_requests_total{job="api"}[30d]))
- name: sli_latency
interval: 30s
rules:
# Proportion of requests completing under 500ms threshold
- record: sli:http_latency_500ms:rate5m
expr: |
sum(rate(http_request_duration_seconds_bucket{le="0.5",job="api"}[5m]))
/
sum(rate(http_request_duration_seconds_count{job="api"}[5m]))Writing SLOs with Meaningful Targets
The instinct when setting an SLO target is to aim for five nines (99.999%). Resist it. A 99.999% availability SLO over 30 days allows 26 seconds of downtime. For most services, this is not only unachievable — it is actively harmful because it means every deployment, every experiment, and every dependency failure instantly burns budget and triggers an incident response. The right SLO is the lowest target that users will not notice you missing.
Start from data, not intuition. Look at your current error rate over the past 90 days, find the percentile that represents your normal operating state (not incident peaks), subtract a small buffer for improvement headroom, and set that as your initial SLO. Tighten it as your reliability engineering matures.
| SLO Target | 30-day error budget | Suitable for |
|---|---|---|
| 99.9% | 43.8 minutes | Internal tools, batch APIs |
| 99.5% | 3.6 hours | Consumer apps, payment flows |
| 99.0% | 7.3 hours | Non-critical microservices |
| 95.0% | 36.5 hours | Experimental features, dev environments |
Note
Error Budgets: The Currency of Reliability
An error budget makes the trade-off between reliability and velocity explicit and quantitative. A team with a 99.9% SLO over 30 days starts each month with 43.8 minutes of allowed downtime. Every incident, risky deployment, and infrastructure failure consumes from that budget. When the budget is exhausted, the implicit agreement is: stop shipping features, focus on reliability.
This is the mechanism that makes SLOs operationally useful rather than just aspirational. Without error budgets, reliability and velocity are two teams arguing. With error budgets, both teams share a number: how much of the month's allowance is left. Product can make an informed decision about whether a risky release is worth the budget cost; engineering can quantify the reliability debt.
# Error budget consumption query in Prometheus
# Shows percentage of 30-day error budget remaining
(
1
-
(
1 - sli:http_availability:rate30d
)
/
(1 - 0.999) # Replace 0.999 with your SLO target
) * 100
# Budget remaining as absolute minutes (30-day window = 43200 minutes)
(
(1 - 0.999) * 43200 # Total budget in minutes
-
(
sum(increase(http_requests_total{status=~"5..",job="api"}[30d]))
/
sum(increase(http_requests_total{job="api"}[30d]))
* 43200
)
)Error budget policies answer the question: "what does the team do when the budget hits a threshold?" Define these thresholds in advance, in writing:
Normal operations
Ship features at normal velocity. Reliability investment proportional to roadmap share.
Caution zone
Defer high-risk changes. Require post-mortems for all incidents. Engineering lead must approve deployments during business-critical windows.
Feature freeze
No new features. Only reliability, on-call fixes, and security patches. All eng bandwidth on reliability improvements until budget recovers.
Budget exhausted — escalation
Incident retro required. No deployments without explicit engineering director approval. SLO target review: either improve reliability or reset to a realistic target.
Alert Design: From Noise to Signal
Most alert configurations in the wild violate one fundamental principle: they page on causes (CPU > 80%, memory at 90%, pod restart) instead of symptoms (users are experiencing errors, latency is degraded). Cause-based alerts create alert fatigue — engineers get paged dozens of times for infrastructure events that never affect users. When the real outage hits, the alert is lost in the noise.
The Google SRE Workbook recommends a two-tier model: page on symptoms (SLO burn rate alerts), ticket on causes (infrastructure anomalies that warrant investigation but not immediate human response). This dramatically reduces overnight pages while catching real user-facing incidents faster.
Multi-Window Multi-Burn-Rate Alerting
A burn rate tells you how fast you are consuming your error budget relative to normal. A burn rate of 1 means you are consuming budget exactly as fast as your SLO allows. A burn rate of 14.4 means you will exhaust a 30-day budget in 50 hours — a real incident. A burn rate of 720 means the budget is gone in 1 hour — a P1.
The multi-window variant adds a short confirmation window to each alert to reduce false positives. Only alert if burn rate is high in both a long window (catches sustained degradation) and a short window (confirms it is still happening). This catches slow burns that simple threshold alerts miss while avoiding pages from brief traffic spikes.
# Multi-window multi-burn-rate SLO alerts
# Based on the Google SRE Workbook pattern
groups:
- name: slo_alerts
rules:
# --- P1: 2% budget in 1 hour (burn rate 14.4×) ---
# Long window: 1h, Short window: 5m
- alert: SLOBurnRateCritical
expr: |
(
(1 - sli:http_availability:rate1h) / (1 - 0.999) > 14.4
)
and
(
(1 - sli:http_availability:rate5m) / (1 - 0.999) > 14.4
)
for: 2m
labels:
severity: page
team: on-call
annotations:
summary: "Critical SLO burn — {{ $labels.job }}"
description: >
Availability SLO burning at >14.4× the allowed rate.
At this rate the 30-day error budget will be exhausted in ~1 hour.
Current 1h availability: {{ printf "%.2f" (mul (index $value 0) 100) }}%
# --- P2: 5% budget in 6 hours (burn rate 6×) ---
# Long window: 6h, Short window: 30m
- alert: SLOBurnRateHigh
expr: |
(
(1 - sli:http_availability:rate6h) / (1 - 0.999) > 6
)
and
(
(1 - sli:http_availability:rate30m) / (1 - 0.999) > 6
)
for: 15m
labels:
severity: page
team: on-call
annotations:
summary: "High SLO burn — {{ $labels.job }}"
description: >
Availability SLO burning at >6× the allowed rate.
Budget will be exhausted in ~6 hours if unaddressed.
# --- Warning: 10% budget in 3 days (burn rate 1×) ---
# Not a page — creates a ticket / Slack alert
- alert: SLOBurnRateElevated
expr: |
(1 - sli:http_availability:rate3d) / (1 - 0.999) > 1
for: 1h
labels:
severity: warning
team: engineering
annotations:
summary: "Elevated SLO burn — {{ $labels.job }}"
description: >
Availability is below SLO target over the past 3 days.
Not an immediate incident, but reliability attention needed.Note
Latency Burn Rate Alerts
Apply the same multi-window pattern to latency SLIs. The good-event ratio for latency is: requests completing under the threshold divided by total requests. Prometheus histograms make this straightforward with the _bucket metric and the le label.
groups:
- name: sli_latency_windows
interval: 30s
rules:
- record: sli:http_latency_500ms:rate1h
expr: |
sum(rate(http_request_duration_seconds_bucket{le="0.5",job="api"}[1h]))
/
sum(rate(http_request_duration_seconds_count{job="api"}[1h]))
- record: sli:http_latency_500ms:rate5m
expr: |
sum(rate(http_request_duration_seconds_bucket{le="0.5",job="api"}[5m]))
/
sum(rate(http_request_duration_seconds_count{job="api"}[5m]))
- record: sli:http_latency_500ms:rate6h
expr: |
sum(rate(http_request_duration_seconds_bucket{le="0.5",job="api"}[6h]))
/
sum(rate(http_request_duration_seconds_count{job="api"}[6h]))
- name: slo_latency_alerts
rules:
- alert: LatencySLOBurnRateCritical
expr: |
(
(1 - sli:http_latency_500ms:rate1h) / (1 - 0.99) > 14.4
)
and
(
(1 - sli:http_latency_500ms:rate5m) / (1 - 0.99) > 14.4
)
for: 2m
labels:
severity: page
annotations:
summary: "Latency SLO burn critical — {{ $labels.job }}"
description: >
p99 latency SLO burning at >14.4×. Check for slow queries,
upstream dependency degradation, or traffic spikes.Instrumenting with OpenTelemetry for SLO-Ready Metrics
SLO alerting is only as good as the underlying metrics. OpenTelemetry provides a vendor-neutral instrumentation layer that produces the histogram and counter metrics Prometheus needs for SLI computation. The key is instrumenting at the correct granularity: per-endpoint, per-status-class, with histogram buckets at your latency SLO threshold.
from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import (
ConsoleMetricExporter,
PeriodicExportingMetricReader,
)
from opentelemetry.exporter.prometheus import PrometheusExporter
import time
# Configure OpenTelemetry with a Prometheus exporter
exporter = PrometheusExporter(port=9090)
reader = PeriodicExportingMetricReader(exporter, export_interval_millis=15000)
provider = MeterProvider(metric_readers=[reader])
metrics.set_meter_provider(provider)
meter = metrics.get_meter("api.slo", version="1.0.0")
# Counter for availability SLI
request_counter = meter.create_counter(
"http_requests_total",
description="Total HTTP requests by status class",
unit="1",
)
# Histogram for latency SLI
# Buckets must include your SLO threshold (0.5s here)
request_duration = meter.create_histogram(
"http_request_duration_seconds",
description="Request duration in seconds",
unit="s",
explicit_bucket_boundaries_advisory=[
0.005, 0.01, 0.025, 0.05, 0.1, 0.25,
0.5, # SLO threshold — keep this exact value
1.0, 2.5, 5.0, 10.0,
],
)
def record_request(endpoint: str, method: str, status_code: int, duration_s: float) -> None:
"""Record a single request for SLI computation."""
status_class = f"{status_code // 100}xx"
request_counter.add(
1,
attributes={
"endpoint": endpoint,
"method": method,
"status": str(status_code),
"status_class": status_class,
},
)
request_duration.record(
duration_s,
attributes={
"endpoint": endpoint,
"method": method,
"status_class": status_class,
},
)
# Middleware pattern (FastAPI / Starlette)
import asyncio
from starlette.middleware.base import BaseHTTPMiddleware
from starlette.requests import Request
class SLOMiddleware(BaseHTTPMiddleware):
async def dispatch(self, request: Request, call_next):
start = time.perf_counter()
response = await call_next(request)
duration = time.perf_counter() - start
record_request(
endpoint=request.url.path,
method=request.method,
status_code=response.status_code,
duration_s=duration,
)
return responseSLO Dashboard: What to Show and Why
An SLO dashboard has one job: give an engineer landing on it within 30 seconds a clear picture of whether the service is within SLO, how fast the error budget is burning, and how much budget remains in the window. It should not require reading six charts to form an opinion.
Structure the dashboard into three rows: current state (live SLI values and burn rates), budget health (rolling 30-day budget remaining + burn rate trend), and incident history (annotations marking deploys and incidents). Use Grafana Stat panels with threshold coloring for instant status — green is within SLO, red is burning.
# Grafana dashboard — key panel queries
# Panel: Current availability (Stat, threshold: red < SLO target)
sum(rate(http_requests_total{status!~"5..",job="$job"}[5m]))
/
sum(rate(http_requests_total{job="$job"}[5m]))
# Panel: Current burn rate (Stat, threshold: red > 1)
(1 - sli:http_availability:rate1h{job="$job"})
/
(1 - 0.999)
# Panel: Error budget remaining % (Gauge, 0-100, thresholds 25/50)
(
1
-
(1 - sli:http_availability:rate30d{job="$job"})
/
(1 - 0.999)
) * 100
# Panel: Budget burn over time (Time series — shows burn rate history)
(1 - sli:http_availability:rate1h{job="$job"})
/
(1 - 0.999)
# Panel: p50 / p95 / p99 latency over time
histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket{job="$job"}[5m])) by (le))
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job="$job"}[5m])) by (le))
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job="$job"}[5m])) by (le))Note
Production Checklist
Define SLIs before writing alerts
For every service in production, document what a "good request" means — success criteria and latency threshold. Store SLI definitions in a YAML file alongside your recording rules so the definition and implementation stay in sync. Without a written SLI definition, your SLO target is just a number without meaning.
Use recording rules for all SLI ratio queries
Range-vector queries over 30d are expensive at evaluation time. Recording rules precompute the ratio every 30s and store the scalar, making multi-window burn rate alert evaluation near-zero cost. Without recording rules, a cluster under load will lag alert evaluation — and you will miss incidents.
Page on burn rate, ticket on causes
Move all infrastructure alerts (CPU, memory, disk, pod restarts, queue depth) to non-paging notification channels — Slack, ticketing systems, dashboards. Reserve on-call pages for SLO burn rate alerts only. Measure and publish your monthly on-call page count. If it exceeds 5 pages per engineer per week, you have a false positive problem that needs systematic review.
Write error budget policies before incidents happen
Define in writing what the team does at 50%, 25%, and 0% budget remaining. Get sign-off from product and engineering leadership. Without pre-approved policies, every budget conversation during an incident becomes a political negotiation. The policies also force teams to discuss the trade-off between reliability and velocity explicitly, usually surfacing misaligned assumptions.
Review SLO targets quarterly
SLO targets should evolve as your reliability improves. A target you set at launch based on your initial error rate may be achievable at 99.5× cost months later. Review quarterly: if you never breach your SLO, the target is probably too loose (wasting reliability investment); if you breach it monthly, it is too tight (causing constant alert fatigue). Aim for a target you breach once or twice per year.
Run post-mortems on every SLO breach, not every incident
Many incidents happen in the error budget — that is what the budget is for. Run post-mortems on incidents that breached the SLO target, not on every alert or on-call page. This keeps post-mortem quality high (engineers take them seriously) and focuses learning on the incidents that matter to users. Template post-mortems should include an error budget section showing how much budget the incident consumed.
Building observability infrastructure or struggling with noisy alerts and undefined reliability targets?
We design SLO frameworks, alert hierarchies, and observability pipelines for production systems — from SLI definition and Prometheus rule configuration to dashboards, burn rate alerts, and error budget workflows. Let’s talk.
Get in Touch