The Problem with Logs and Metrics in Microservices
In a monolithic application, debugging a slow request is straightforward: grep the log file, find the slow SQL query, fix the index. In a microservices architecture, a single user-facing request might fan out across a dozen services — an API gateway, an auth service, a product catalogue, an inventory check, a payment processor, and a notification service. When that request takes four seconds instead of 400 milliseconds, structured logs tell you what each service did independently, and metrics tell you the aggregate latency per service. But neither tells you which path this specific request took, where it waited, or which downstream call caused the tail latency.
Distributed tracing solves this. A trace is a directed acyclic graph of spans — each span representing a unit of work in one service, with its start time, duration, parent span, and arbitrary key-value attributes. A trace ID propagated in HTTP headers stitches all spans for a single request into a single timeline. You see exactly where time was spent, which service called which, and where errors were thrown.
Jaeger is the CNCF-graduated distributed tracing system originally open-sourced by Uber. It implements the OpenTelemetry protocol natively and stores traces in Elasticsearch, Cassandra, Badger (embedded), or the newer ClickHouse backend. Jaeger 2.x unifies the previously separate agent, collector, and ingester components into a single binary, simplifying deployment significantly.
Jaeger Architecture — Components and Data Flow
Understanding what Jaeger runs helps you size resources correctly and design your pipeline for reliability. The modern Jaeger 2.x architecture is simpler than earlier versions:
OpenTelemetry SDK (in your service)
Your application code (or an auto-instrumentation agent) creates spans and exports them via OTLP — the OpenTelemetry Protocol — over gRPC (port 4317) or HTTP (port 4318). The SDK handles sampling decisions, context propagation, and batching before export.
OpenTelemetry Collector (optional but recommended)
A vendor-neutral proxy that receives OTLP from services, applies processors (tail-based sampling, attribute enrichment, PII redaction), and fans out to multiple backends simultaneously. Decouples your services from backend vendor lock-in.
Jaeger Collector
Receives spans from the OTEL Collector (or directly from SDKs) via OTLP, validates them, and writes to the configured storage backend. In Jaeger 2.x, the collector and query components run as a single binary.
Storage Backend
Jaeger supports Elasticsearch/OpenSearch (recommended for large scale), Apache Cassandra (high write throughput), Badger (embedded key-value, suitable for development and single-node deployments), and ClickHouse via a community plugin.
Jaeger Query + UI
Exposes the HTTP API and serves the React-based Jaeger UI on port 16686. The API is also used by Grafana's Jaeger data source and can be queried directly for programmatic trace analysis.
Instrumenting Services with OpenTelemetry
The OpenTelemetry Python SDK offers two instrumentation modes: auto-instrumentation via a zero-code agent that patches popular frameworks at startup, and manual instrumentation for custom spans on business logic. Most production setups use both: auto-instrumentation handles HTTP, database, and messaging framework spans automatically; manual spans add business-level context.
# Install dependencies
# pip install opentelemetry-sdk opentelemetry-exporter-otlp-proto-grpc
# pip install opentelemetry-instrumentation-fastapi opentelemetry-instrumentation-httpx
# pip install opentelemetry-instrumentation-sqlalchemy opentelemetry-instrumentation-redis
# tracing.py — centralised tracing setup, imported once at startup
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource, SERVICE_NAME, SERVICE_VERSION
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
import os
def setup_tracing(app=None, db_engine=None):
resource = Resource.create({
SERVICE_NAME: os.getenv("SERVICE_NAME", "unknown-service"),
SERVICE_VERSION: os.getenv("SERVICE_VERSION", "0.0.0"),
"deployment.environment": os.getenv("DEPLOY_ENV", "development"),
"host.name": os.uname().nodename,
})
exporter = OTLPSpanExporter(
endpoint=os.getenv("OTEL_EXPORTER_OTLP_ENDPOINT", "http://otel-collector:4317"),
insecure=True,
)
provider = TracerProvider(resource=resource)
provider.add_span_processor(
BatchSpanProcessor(
exporter,
max_queue_size=2048,
schedule_delay_millis=5000,
max_export_batch_size=512,
)
)
trace.set_tracer_provider(provider)
# Auto-instrument frameworks
if app:
FastAPIInstrumentor.instrument_app(
app,
excluded_urls="/health,/metrics,/ready",
)
HTTPXClientInstrumentor().instrument()
if db_engine:
SQLAlchemyInstrumentor().instrument(engine=db_engine)
return trace.get_tracer(__name__)# main.py — FastAPI service with tracing
from fastapi import FastAPI, HTTPException
from opentelemetry import trace
from opentelemetry.trace import Status, StatusCode
from tracing import setup_tracing
app = FastAPI(title="order-service")
tracer = setup_tracing(app)
@app.post("/orders")
async def create_order(payload: dict):
with tracer.start_as_current_span("create_order") as span:
# Add business context to the span
span.set_attributes({
"order.customer_id": payload.get("customer_id"),
"order.item_count": len(payload.get("items", [])),
"order.total_usd": payload.get("total"),
})
try:
# Validate inventory — child span created automatically by HTTPX instrumentation
async with httpx.AsyncClient() as client:
resp = await client.post(
"http://inventory-service/reserve",
json={"items": payload["items"]},
)
resp.raise_for_status()
# Charge payment — manual span for business-level visibility
with tracer.start_as_current_span("charge_payment") as payment_span:
payment_span.set_attribute("payment.method", payload.get("payment_method"))
order_id = await charge_and_persist(payload)
payment_span.set_attribute("order.id", order_id)
span.set_attribute("order.id", order_id)
span.add_event("order_created", {"order_id": order_id})
return {"order_id": order_id, "status": "confirmed"}
except httpx.HTTPStatusError as exc:
# Record the error on the span — visible in Jaeger as a red span
span.set_status(Status(StatusCode.ERROR, str(exc)))
span.record_exception(exc)
raise HTTPException(status_code=503, detail="Inventory unavailable")
@app.get("/health")
async def health():
return {"status": "ok"}Note
span.record_exception(exc) inside exception handlers — this attaches the stack trace to the span and marks it as an error in Jaeger's service graph, making error propagation visible across service boundaries without manual log correlation.Go Instrumentation — Manual Spans and Baggage
Go services benefit from the OpenTelemetry Go SDK and the otelhttp and otelgrpc interceptors for HTTP and gRPC auto-instrumentation. Baggage — key-value pairs propagated alongside the trace context — lets you attach business identifiers (tenant ID, user ID, feature flags) to every span in a request tree without threading them through every function signature.
// tracing/setup.go
package tracing
import (
"context"
"os"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
"go.opentelemetry.io/otel/propagation"
"go.opentelemetry.io/otel/sdk/resource"
sdktrace "go.opentelemetry.io/otel/sdk/trace"
semconv "go.opentelemetry.io/otel/semconv/v1.26.0"
)
func SetupTracer(ctx context.Context) (*sdktrace.TracerProvider, error) {
endpoint := os.Getenv("OTEL_EXPORTER_OTLP_ENDPOINT")
if endpoint == "" {
endpoint = "otel-collector:4317"
}
exporter, err := otlptracegrpc.New(ctx,
otlptracegrpc.WithEndpoint(endpoint),
otlptracegrpc.WithInsecure(),
)
if err != nil {
return nil, err
}
res := resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceName(os.Getenv("SERVICE_NAME")),
semconv.ServiceVersion(os.Getenv("SERVICE_VERSION")),
attribute.String("deployment.environment", os.Getenv("DEPLOY_ENV")),
)
tp := sdktrace.NewTracerProvider(
sdktrace.WithBatcher(exporter),
sdktrace.WithResource(res),
// Head-based sampling: 10% in production, 100% in dev
sdktrace.WithSampler(
sdktrace.ParentBased(sdktrace.TraceIDRatioBased(samplingRate())),
),
)
otel.SetTracerProvider(tp)
// Propagate both W3C TraceContext and Baggage headers
otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(
propagation.TraceContext{},
propagation.Baggage{},
))
return tp, nil
}
func samplingRate() float64 {
if os.Getenv("DEPLOY_ENV") == "production" {
return 0.10
}
return 1.0
}// handlers/payment.go — attaching baggage to propagate tenant context
package handlers
import (
"context"
"net/http"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/baggage"
"go.opentelemetry.io/otel/codes"
"go.opentelemetry.io/otel/attribute"
)
var tracer = otel.Tracer("payment-service")
func ProcessPayment(w http.ResponseWriter, r *http.Request) {
ctx := r.Context()
// Attach tenant ID to baggage — propagates to all downstream spans
tenantMember, _ := baggage.NewMember("tenant.id", r.Header.Get("X-Tenant-ID"))
bag, _ := baggage.New(tenantMember)
ctx = baggage.ContextWithBaggage(ctx, bag)
ctx, span := tracer.Start(ctx, "process_payment",
trace.WithAttributes(
attribute.String("payment.method", r.FormValue("method")),
attribute.Float64("payment.amount_usd", parseAmount(r)),
),
)
defer span.End()
if err := callPaymentGateway(ctx, r); err != nil {
span.SetStatus(codes.Error, err.Error())
span.RecordError(err)
http.Error(w, "payment failed", http.StatusBadGateway)
return
}
span.SetStatus(codes.Ok, "")
w.WriteHeader(http.StatusOK)
}Sampling Strategies — Head-Based, Probabilistic, and Tail-Based
At scale, recording every span for every request is prohibitively expensive — a service handling 50,000 RPS would generate millions of spans per minute. Sampling reduces this load while preserving high-value traces. There are two families of sampling strategy:
Head-based sampling makes the keep/drop decision at the root span, before any downstream calls. It is fast and requires no buffering, but it is blind to outcomes — it cannot preferentially keep traces that contain errors or high latency, because those facts are not yet known at decision time. Tail-based sampling, performed by the OpenTelemetry Collector after all spans for a trace arrive, can make intelligent decisions: keep 100% of error traces, keep 100% of traces over 2 seconds, sample 1% of successful fast traces. This is the recommended strategy for production.
# otel-collector-config.yaml — tail-based sampling in the OpenTelemetry Collector
# Deploy alongside your services; receives OTLP, samples, and forwards to Jaeger
receivers:
otlp:
protocols:
grpc:
endpoint: "0.0.0.0:4317"
http:
endpoint: "0.0.0.0:4318"
processors:
# Tail-based sampling — waits for all spans in a trace before deciding
tail_sampling:
decision_wait: 10s # wait up to 10s for late-arriving spans
num_traces: 100000 # max in-memory traces (size your memory accordingly)
expected_new_traces_per_sec: 10000
policies:
# Always keep error traces
- name: errors-policy
type: status_code
status_code:
status_codes: [ERROR]
# Always keep slow traces (> 2 seconds end-to-end)
- name: latency-policy
type: latency
latency:
threshold_ms: 2000
# Keep all traces for specific critical endpoints
- name: critical-endpoints
type: string_attribute
string_attribute:
key: http.target
values: ["/checkout", "/payment", "/orders"]
enabled_regex_matching: false
# Sample 5% of healthy fast traces (baseline visibility)
- name: probabilistic-baseline
type: probabilistic
probabilistic:
sampling_percentage: 5
# Enrich spans with k8s metadata (pod name, namespace, node)
k8sattributes:
auth_type: serviceAccount
passthrough: false
filter:
node_from_env_var: K8S_NODE_NAME
extract:
metadata:
- k8s.pod.name
- k8s.pod.uid
- k8s.deployment.name
- k8s.namespace.name
- k8s.node.name
# Redact sensitive attributes before storing
transform:
error_mode: ignore
trace_statements:
- context: span
statements:
- replace_pattern(attributes["http.request.header.authorization"], "Bearer .+", "Bearer [REDACTED]")
- delete_key(attributes, "db.statement") where attributes["db.system"] == "postgresql" and IsMatch(attributes["db.statement"], ".*password.*")
batch:
timeout: 5s
send_batch_size: 1000
exporters:
otlp/jaeger:
endpoint: "jaeger:4317"
tls:
insecure: true
# Also export to Prometheus for trace-derived metrics
prometheus:
endpoint: "0.0.0.0:8889"
service:
pipelines:
traces:
receivers: [otlp]
processors: [tail_sampling, k8sattributes, transform, batch]
exporters: [otlp/jaeger]Note
decision_wait window in tail sampling determines how long the Collector buffers spans before making a sampling decision. Set it higher than your p99 request duration — if a slow trace takes 8 seconds but decision_wait is 5 seconds, its root span may be evaluated before the final child spans arrive, causing premature or incorrect decisions.Deploying Jaeger — Docker Compose for Local Dev
Jaeger 2.x ships as a single binary that subsumes the old agent, collector, ingester, and query roles. For local development and integration testing, a Docker Compose stack with Jaeger and the OTEL Collector is sufficient. For production, the Helm chart on Kubernetes with Elasticsearch storage is the standard approach.
# docker-compose.yml — Jaeger + OTEL Collector for local development
# Start: docker compose up -d
# Jaeger UI: http://localhost:16686
# OTLP gRPC endpoint for your services: localhost:4317
version: "3.9"
services:
jaeger:
image: jaegertracing/jaeger:2.2.0
container_name: jaeger
ports:
- "16686:16686" # Jaeger UI
- "4317:4317" # OTLP gRPC (direct from services if no collector)
- "4318:4318" # OTLP HTTP
- "14269:14269" # Admin / health check
environment:
- SPAN_STORAGE_TYPE=badger
- BADGER_EPHEMERAL=false
- BADGER_DIRECTORY_VALUE=/badger/data
- BADGER_DIRECTORY_KEY=/badger/key
volumes:
- jaeger-badger:/badger
healthcheck:
test: ["CMD", "wget", "--spider", "-q", "http://localhost:14269/"]
interval: 10s
timeout: 5s
retries: 5
otel-collector:
image: otel/opentelemetry-collector-contrib:0.99.0
container_name: otel-collector
command: ["--config=/etc/otel-collector-config.yaml"]
volumes:
- ./otel-collector-config.yaml:/etc/otel-collector-config.yaml:ro
ports:
- "4317:4317" # OTLP gRPC from services
- "4318:4318" # OTLP HTTP from services
- "8889:8889" # Prometheus metrics exporter
depends_on:
jaeger:
condition: service_healthy
# Example: your order service instrumented with OTEL
order-service:
build: ./services/order-service
environment:
- OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
- SERVICE_NAME=order-service
- SERVICE_VERSION=1.0.0
- DEPLOY_ENV=development
depends_on:
- otel-collector
volumes:
jaeger-badger:Jaeger on Kubernetes with Helm and Elasticsearch
The official Jaeger Helm chart supports both Badger (for single-node, low-traffic deployments) and Elasticsearch/OpenSearch (for production). Pair the Jaeger chart with the OpenTelemetry Collector Helm chart deployed as a DaemonSet (one collector pod per node, receiving traffic from all pods on that node) or a Deployment (centralised collector with horizontal scaling).
# jaeger-values.yaml — Jaeger Helm chart values for production
# helm repo add jaegertracing https://jaegertracing.github.io/helm-charts
# helm install jaeger jaegertracing/jaeger -f jaeger-values.yaml -n observability
provisionDataStore:
cassandra: false
elasticsearch: false # assume externally managed ES
storage:
type: elasticsearch
elasticsearch:
host: elasticsearch.observability.svc.cluster.local
port: 9200
scheme: http
indexPrefix: jaeger
useAliases: true
# ILM policy name — must exist in your Elasticsearch cluster
indexDateSeparator: "-"
collector:
replicaCount: 3
autoscaling:
enabled: true
minReplicas: 3
maxReplicas: 10
targetCPUUtilizationPercentage: 70
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 2000m
memory: 2Gi
# Receive OTLP directly (if not using separate OTEL Collector)
service:
otlp:
grpc:
port: 4317
http:
port: 4318
query:
replicaCount: 2
resources:
requests:
cpu: 200m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
ingress:
enabled: true
ingressClassName: nginx
annotations:
nginx.ingress.kubernetes.io/auth-type: basic
nginx.ingress.kubernetes.io/auth-secret: jaeger-basic-auth
hosts:
- host: jaeger.internal.example.com
paths:
- path: /
pathType: Prefix# otel-collector-daemonset-values.yaml — OTEL Collector as DaemonSet
# helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
# helm install otel-collector open-telemetry/opentelemetry-collector # -f otel-collector-daemonset-values.yaml -n observability
mode: daemonset
config:
receivers:
otlp:
protocols:
grpc:
endpoint: ${env:MY_POD_IP}:4317
http:
endpoint: ${env:MY_POD_IP}:4318
processors:
tail_sampling:
decision_wait: 10s
num_traces: 50000
policies:
- name: errors
type: status_code
status_code:
status_codes: [ERROR]
- name: slow-traces
type: latency
latency:
threshold_ms: 1500
- name: baseline
type: probabilistic
probabilistic:
sampling_percentage: 3
k8sattributes:
auth_type: serviceAccount
extract:
metadata:
- k8s.pod.name
- k8s.namespace.name
- k8s.deployment.name
- k8s.node.name
batch:
timeout: 5s
exporters:
otlp/jaeger:
endpoint: "jaeger-collector.observability.svc.cluster.local:4317"
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [tail_sampling, k8sattributes, batch]
exporters: [otlp/jaeger]
# Pod-level RBAC for k8s metadata enrichment
clusterRole:
create: true
rules:
- apiGroups: [""]
resources: ["pods", "namespaces", "nodes"]
verbs: ["get", "watch", "list"]
# Resource requests tuned for DaemonSet (runs on every node)
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m
memory: 512MiQuerying Traces — Jaeger API, Grafana, and Programmatic Analysis
The Jaeger HTTP API (available at /api/traces and /api/services) makes trace data queryable beyond the Jaeger UI. This is useful for automated analysis — finding traces above a latency threshold, extracting error rates per service, or identifying the most common call paths. The Jaeger UI also supports deep dependency graphs that visualise the live call topology derived from actual trace data, complementing static service diagrams.
# Python script — query Jaeger API for high-latency traces and alert
# Usage: python3 trace_analysis.py --service order-service --min-duration 2000000
import argparse
import httpx
import json
from datetime import datetime, timedelta, timezone
JAEGER_QUERY_URL = "http://jaeger-query.observability.svc.cluster.local:16686"
def fetch_slow_traces(service: str, min_duration_us: int, lookback_hours: int = 1):
end_time = datetime.now(timezone.utc)
start_time = end_time - timedelta(hours=lookback_hours)
params = {
"service": service,
"minDuration": f"{min_duration_us}us",
"start": int(start_time.timestamp() * 1_000_000), # Jaeger uses microseconds
"end": int(end_time.timestamp() * 1_000_000),
"limit": 200,
}
resp = httpx.get(f"{JAEGER_QUERY_URL}/api/traces", params=params, timeout=30)
resp.raise_for_status()
return resp.json()["data"]
def summarise_slow_traces(traces):
slow = []
for trace in traces:
spans = trace["spans"]
root = next((s for s in spans if not s["references"]), None)
if not root:
continue
duration_ms = root["duration"] / 1000
error_spans = [s for s in spans if any(
t["key"] == "error" and t["value"] == "true"
for t in s.get("tags", [])
)]
slow.append({
"trace_id": trace["traceID"],
"duration_ms": round(duration_ms, 1),
"span_count": len(spans),
"has_error": len(error_spans) > 0,
"operation": root["operationName"],
"url": f"{JAEGER_QUERY_URL}/trace/{trace['traceID']}",
})
return sorted(slow, key=lambda x: x["duration_ms"], reverse=True)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--service", required=True)
parser.add_argument("--min-duration", type=int, default=2_000_000) # 2s in µs
args = parser.parse_args()
traces = fetch_slow_traces(args.service, args.min_duration)
summary = summarise_slow_traces(traces)
for t in summary[:10]:
flag = "🔴" if t["has_error"] else "🟡"
print(f"{flag} {t['duration_ms']}ms {t['operation']} → {t['url']}")Grafana's Jaeger data source surfaces traces alongside metrics and logs in the same dashboard. The Explore view lets you jump from a Loki log line (matching a trace ID) directly to the Jaeger UI trace view, and from a Prometheus high-latency alert to the offending trace. This correlated observability — logs, metrics, and traces navigable from a single interface — is what makes Grafana the natural frontend for Jaeger in production.
# Grafana datasource provisioning for Jaeger + Tempo link-back
# /etc/grafana/provisioning/datasources/jaeger.yaml
apiVersion: 1
datasources:
- name: Jaeger
type: jaeger
uid: jaeger
url: http://jaeger-query.observability.svc.cluster.local:16686
access: proxy
isDefault: false
jsonData:
# Node graph view — shows service dependency graph inside Grafana
nodeGraph:
enabled: true
# Trace-to-logs: jump from a span to matching log lines in Loki
tracesToLogs:
datasourceUid: loki
tags: ["service.name", "k8s.pod.name"]
mapTagNamesEnabled: true
mappedTags:
- key: service.name
value: app
filterByTraceID: true
filterBySpanID: false
# Trace-to-metrics: jump from span to related Prometheus panel
tracesToMetrics:
datasourceUid: prometheus
tags:
- key: service.name
value: job
queries:
- name: "Request rate"
query: 'sum(rate(http_server_duration_bucket{job="$__tags.job"}[5m]))'
- name: "Error rate"
query: 'sum(rate(http_server_duration_count{job="$__tags.job",http_status_code=~"5.."}[5m]))'Extracting SLO Metrics from Trace Data
The OTEL Collector's spanmetrics connector converts trace spans into Prometheus-compatible histograms and counters without requiring separate metric instrumentation. This gives you a consistent latency and error rate view derived from actual request traces — the same data source as your trace store — rather than relying on separate metric instrumentation that may diverge.
# otel-collector-spanmetrics.yaml — generate RED metrics from trace spans
# RED: Rate, Errors, Duration — the three golden signals for microservices
receivers:
otlp:
protocols:
grpc:
endpoint: "0.0.0.0:4317"
connectors:
# spanmetrics: derives Prometheus histograms from span data
spanmetrics:
namespace: traces
histogram:
explicit:
buckets: [10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 2.5s, 5s, 10s]
dimensions:
- name: http.method
- name: http.status_code
- name: http.route
- name: service.name
- name: k8s.namespace.name
exemplars:
enabled: true # attach trace IDs to histogram buckets for exemplar support
processors:
tail_sampling:
decision_wait: 10s
policies:
- name: errors
type: status_code
status_code:
status_codes: [ERROR]
- name: slow
type: latency
latency:
threshold_ms: 1000
- name: baseline
type: probabilistic
probabilistic:
sampling_percentage: 5
batch:
timeout: 5s
exporters:
otlp/jaeger:
endpoint: "jaeger:4317"
tls:
insecure: true
prometheus:
endpoint: "0.0.0.0:8889"
service:
pipelines:
# Traces pipeline: receives OTLP, samples, exports to Jaeger AND spanmetrics connector
traces:
receivers: [otlp]
processors: [tail_sampling, batch]
exporters: [otlp/jaeger, spanmetrics]
# Metrics pipeline: receives from spanmetrics connector, exposes as Prometheus
metrics/spanmetrics:
receivers: [spanmetrics]
exporters: [prometheus]# Prometheus alerting rules derived from spanmetrics
# /etc/prometheus/rules/trace-slos.yaml
groups:
- name: trace-slos
interval: 30s
rules:
# Error rate SLO: < 1% errors per service over 5 minutes
- alert: HighErrorRate
expr: |
sum(rate(traces_spanmetrics_calls_total{status_code="STATUS_CODE_ERROR"}[5m])) by (service_name)
/
sum(rate(traces_spanmetrics_calls_total[5m])) by (service_name)
> 0.01
for: 2m
labels:
severity: warning
annotations:
summary: "High error rate on {{ $labels.service_name }}"
description: "Error rate is {{ humanizePercentage $value }} over last 5 minutes"
runbook: "https://runbooks.internal/high-error-rate"
# Latency SLO: p99 < 2s for all services
- alert: HighP99Latency
expr: |
histogram_quantile(0.99,
sum(rate(traces_spanmetrics_duration_milliseconds_bucket[5m])) by (service_name, le)
) > 2000
for: 5m
labels:
severity: warning
annotations:
summary: "p99 latency above 2s on {{ $labels.service_name }}"
description: "p99 is {{ humanizeDuration ($value / 1000) }}"
runbook: "https://runbooks.internal/high-latency"Note
--enable-feature=exemplar-storage and set exemplars: enabled: true in the spanmetrics connector.Distributed Tracing Production Checklist
W3C TraceContext headers are propagated across all service boundaries
HTTP services must pass `traceparent` and `tracestate` headers on every outbound call. gRPC services must use the grpc-metadata propagator. Missing propagation creates disconnected traces — you'll see individual service spans but no request graph. Audit propagation by searching Jaeger for traces with a single span.
Tail-based sampling is configured in the OTEL Collector
Head-based sampling at 10% means you miss 90% of error traces — the most valuable data. Deploy the OTEL Collector with tail_sampling configured to keep 100% of error and high-latency traces, and a low percentage (1–5%) of healthy traces for baseline visibility. Size collector memory for `num_traces × average_trace_size`.
Sensitive attributes are redacted before storage
Spans automatically capture HTTP headers, SQL queries, and URL parameters. These may contain PII, tokens, and passwords. Use the OTEL Collector's `transform` processor to redact or delete sensitive attributes (Authorization headers, full DB statements for tables containing PII) before traces reach Jaeger storage.
Elasticsearch index lifecycle management is configured for trace data
Jaeger writes one Elasticsearch index per day by default. Without ILM, trace indices grow unbounded. Configure an ILM policy with a hot phase (7 days on SSD), a warm phase (21 days on HDD), and a delete phase (30 days). Reference the policy in `jaeger-values.yaml`.
Span attributes follow OpenTelemetry semantic conventions
Use `http.method`, `http.route`, `db.system`, `db.statement`, `messaging.system` — not custom names. Semantic convention attributes are automatically understood by Jaeger's service graph, Grafana's spanmetrics, and future OTEL-aware tooling. Inconsistent naming makes cross-service analysis impossible.
Service names are stable and environment-scoped
A service named `order-service` in production and `orders` in staging produces separate Jaeger service graphs, breaking cross-environment comparison. Use consistent naming via the `SERVICE_NAME` environment variable and inject `deployment.environment` as a resource attribute to filter in Jaeger's UI.
spanmetrics connector generates RED metrics for all services
Traces alone cannot power Prometheus alerts — alerting requires metrics. The spanmetrics connector in the OTEL Collector auto-generates rate, error, and duration histograms from span data. This is more reliable than maintaining separate metric instrumentation and ensures metrics and traces tell the same story.
Jaeger Collector is horizontally scaled with anti-affinity rules
The Jaeger Collector is stateless — scale it horizontally behind a load balancer. Set Kubernetes pod anti-affinity to spread replicas across nodes. Memory-limit the Collector pods (it buffers spans in memory before flushing to storage) and configure HPA to scale on CPU. Monitor the `jaeger_collector_queue_length` metric for back-pressure.
Work with us
Instrumenting microservices for distributed tracing or building a production observability stack with Jaeger?
We design and implement distributed tracing pipelines — from OpenTelemetry SDK instrumentation and context propagation to tail-based sampling configuration, Jaeger Helm deployments on Kubernetes, spanmetrics-driven SLO alerting, and Grafana trace-to-log-to-metric correlation. Let’s talk.
Get in touch