What industries do you work with?

We work across a wide range of industries including finance, healthcare, e-commerce, logistics, and telecommunications. Our solutions are tailored to each client’s specific domain requirements and regulatory environment.

How long does a typical engagement take?

It depends on the scope. A focused observability deployment or automation workflow can be delivered in 4-6 weeks. Larger initiatives like full-scale LLM integration or platform builds typically run 2-4 months. We always start with a discovery phase to align on timelines.

Do you offer ongoing support after project delivery?

Yes. We offer flexible support and maintenance plans to ensure your systems stay healthy, updated, and optimized. We can also embed with your team on a part-time basis for continuous improvement.

Can you work with our existing tech stack?

Absolutely. We integrate with your current infrastructure and tools rather than forcing a rip-and-replace. Whether you’re on AWS, GCP, Azure, or on-prem, we adapt our approach to what works best for your environment.

What is your pricing model?

We offer both fixed-price project engagements and time-and-materials contracts depending on the nature of the work. Reach out through our contact form and we’ll provide a tailored estimate within 24 hours.

How do you handle data security and compliance?

Security is built into every engagement. We follow industry best practices for data handling, support GDPR and SOC 2 compliance requirements, and can work within your existing security policies and access controls.

Grafana + Loki + Alertmanager — Complete Observability Stack Without Elasticsearch

Why Replace Elasticsearch for Logs?

The Elastic Stack has been the de-facto standard for log aggregation for over a decade. It works, it scales, and the query language is powerful. But running Elasticsearch in production has a cost that goes beyond the license bill: heap tuning, shard management, replica overhead, JVM garbage collection pauses, and the operational burden of keeping clusters healthy at scale. For teams that already run Prometheus and Grafana, adding a full Elasticsearch cluster for logs creates a second observability stack with its own operational surface area.

Grafana Loki was designed as an answer to that problem. It is a horizontally scalable, highly available log aggregation system inspired by Prometheus. Where Elasticsearch indexes every field in every log line, Loki indexes only metadata labels — the same label model Prometheus uses for metrics. Log content is compressed and stored as chunks in object storage (S3, GCS, ADLS). The result is dramatically lower storage and operational cost at the expense of full-text search flexibility.

The PLG stack — Prometheus, Loki, Grafana — gives you metrics, logs, and alerting in a single unified Grafana interface, with a shared label vocabulary and correlated query experience. This article covers everything needed to deploy and operate this stack in production: Loki architecture, Promtail scraping, LogQL queries, Ruler-based alerting, Alertmanager routing, and the production patterns that prevent common failure modes.

Loki Architecture: Labels, Streams, and Chunks

Understanding Loki's architecture is the prerequisite for operating it correctly. Loki's fundamental data model is the log stream: a sequence of log lines sharing an identical set of label key-value pairs. Labels like {app="api", env="prod", pod="api-7d4c9b-xkpz2"} uniquely identify a stream. All logs matching those labels are stored in the same compressed chunk. The index tracks only which chunk files correspond to which label sets — not the log content itself.

Loki runs as a set of microservices that can be deployed as a single binary (monolithic mode for development) or as independent components (microservices mode for production scale):

Distributor

Receives log streams from agents (Promtail, Fluent Bit, OpenTelemetry Collector), validates labels, applies rate limits, and fans out to ingesters via consistent hashing. The distributor is stateless and horizontally scalable.

Ingester

Holds log chunks in memory, orders them by timestamp, and periodically flushes compressed chunks to object storage. Ingesters are stateful — they own specific streams via the consistent hash ring. WAL (write-ahead log) on local disk protects against data loss on crash before flush.

Querier

Handles LogQL queries. It fetches data from both in-memory ingesters (recent data) and object storage (historical data), merges and deduplicates, and returns results. The query frontend sits in front of queriers to handle query splitting, caching, and retries.

Compactor

Merges index tables and applies retention policies. The compactor deduplicates index entries created by multiple ingesters and runs retention deletion on chunk data in object storage.

# Loki storage layout on S3
s3://my-observability/loki/
├── index/
│   ├── loki_index_19850/          ← daily index table (days since epoch)
│   │   ├── ingester-1-1746000000  ← index file per ingester per period
│   │   └── ingester-2-1746000000
│   └── loki_index_19851/
│       ├── ingester-1-1746086400
│       └── ingester-2-1746086400
└── chunks/
    ├── fake/                      ← tenant ID ("fake" = single-tenant)
    │   ├── 01HW3K.../             ← chunk directory (first 10 chars of hash)
    │   │   └── <chunk-hash>       ← compressed chunk: snappy/gzip encoded log lines
    │   └── 01HW4M.../
    └── ...

# Each chunk file contains log lines for ONE stream (one unique label set)
# Chunk format: header + sorted, compressed log entries
# Typical compressed size: 10-50KB per chunk (many MB of raw logs)
# Chunk age: flushed after idle_timeout (default 5m) or max age (default 1h)

Note

Loki's label cardinality limit is the most important operational constraint to understand. Each unique label combination creates a new stream. If you use high-cardinality values — request IDs, user IDs, trace IDs — as Loki labels, you create millions of streams, overwhelming the index and causing out-of-memory errors in ingesters. Labels should be low-cardinality dimensions: app, env, namespace, job. High-cardinality values belong in the log line content, queryable via LogQL's parsed filters.

Full Stack: Docker Compose Deployment

The following Docker Compose configuration deploys a complete PLG observability stack: Prometheus for metrics, Loki for logs, Promtail for log scraping, Alertmanager for alert routing, and Grafana as the unified UI. This is suitable for development environments and small production deployments (single-node Loki with local filesystem storage).

# docker-compose.yml — complete PLG observability stack
version: "3.9"

networks:
  observability:
    driver: bridge

volumes:
  grafana_data:
  prometheus_data:
  loki_data:
  alertmanager_data:

services:

  # ── Loki ──────────────────────────────────────────────────────────────────
  loki:
    image: grafana/loki:3.1.0
    container_name: loki
    ports:
      - "3100:3100"
    command: -config.file=/etc/loki/loki.yaml
    volumes:
      - ./config/loki.yaml:/etc/loki/loki.yaml:ro
      - loki_data:/loki
    networks:
      - observability
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "wget", "--quiet", "--tries=1", "--spider", "http://localhost:3100/ready"]
      interval: 10s
      timeout: 5s
      retries: 5

  # ── Promtail ──────────────────────────────────────────────────────────────
  promtail:
    image: grafana/promtail:3.1.0
    container_name: promtail
    command: -config.file=/etc/promtail/promtail.yaml
    volumes:
      - ./config/promtail.yaml:/etc/promtail/promtail.yaml:ro
      - /var/log:/var/log:ro
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
      - /var/run/docker.sock:/var/run/docker.sock
    networks:
      - observability
    depends_on:
      loki:
        condition: service_healthy
    restart: unless-stopped

  # ── Prometheus ────────────────────────────────────────────────────────────
  prometheus:
    image: prom/prometheus:v2.52.0
    container_name: prometheus
    ports:
      - "9090:9090"
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"
      - "--storage.tsdb.path=/prometheus"
      - "--storage.tsdb.retention.time=30d"
      - "--web.enable-lifecycle"
      - "--web.enable-remote-write-receiver"
    volumes:
      - ./config/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./config/rules/:/etc/prometheus/rules/:ro
      - prometheus_data:/prometheus
    networks:
      - observability
    restart: unless-stopped

  # ── Alertmanager ──────────────────────────────────────────────────────────
  alertmanager:
    image: prom/alertmanager:v0.27.0
    container_name: alertmanager
    ports:
      - "9093:9093"
    command:
      - "--config.file=/etc/alertmanager/alertmanager.yml"
      - "--storage.path=/alertmanager"
      - "--web.external-url=http://alertmanager:9093"
    volumes:
      - ./config/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
      - alertmanager_data:/alertmanager
    networks:
      - observability
    restart: unless-stopped

  # ── Grafana ───────────────────────────────────────────────────────────────
  grafana:
    image: grafana/grafana:11.0.0
    container_name: grafana
    ports:
      - "3000:3000"
    environment:
      GF_SECURITY_ADMIN_USER: admin
      GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_ADMIN_PASSWORD:-changeme}
      GF_FEATURE_TOGGLES_ENABLE: correlations lokiQuerySplitting
      GF_AUTH_ANONYMOUS_ENABLED: "false"
    volumes:
      - grafana_data:/var/lib/grafana
      - ./config/grafana/datasources:/etc/grafana/provisioning/datasources:ro
      - ./config/grafana/dashboards:/etc/grafana/provisioning/dashboards:ro
    networks:
      - observability
    depends_on:
      loki:
        condition: service_healthy
    restart: unless-stopped

Loki Configuration: Storage, Limits, and Retention

Loki's configuration controls storage backends, ingestion rate limits, query timeouts, compaction, and retention. The following configuration uses local filesystem storage for simplicity — replace the filesystem backend with an S3 configuration for production deployments that survive node restarts.

# config/loki.yaml
auth_enabled: false   # set true for multi-tenant (requires X-Scope-OrgID header)

server:
  http_listen_port: 3100
  grpc_listen_port: 9096
  log_level: info

common:
  path_prefix: /loki
  storage:
    filesystem:
      chunks_directory: /loki/chunks
      rules_directory: /loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

# Schema configuration: defines how data is stored over time
schema_config:
  configs:
    - from: 2024-01-01
      store: tsdb          # TSDB index (preferred over BoltDB in Loki 3.x)
      object_store: filesystem
      schema: v13          # current recommended schema version
      index:
        prefix: loki_index_
        period: 24h        # one index table per day

ingester:
  chunk_idle_period: 5m        # flush chunk if idle for 5 minutes
  chunk_retain_period: 30s     # retain chunk in memory after flush
  max_chunk_age: 1h            # always flush after 1 hour regardless
  chunk_target_size: 1572864   # target 1.5 MB per chunk
  wal:
    enabled: true
    dir: /loki/wal

# Global and per-tenant ingestion limits
limits_config:
  # Ingestion rate limits (per tenant in multi-tenant, global in single-tenant)
  ingestion_rate_mb: 32               # MB/s ingestion rate
  ingestion_burst_size_mb: 64         # burst allowance
  per_stream_rate_limit: 3MB          # max rate per individual stream
  per_stream_rate_limit_burst: 15MB

  # Cardinality limits — CRITICAL for production stability
  max_label_names_per_series: 30      # max number of label keys per stream
  max_label_value_length: 2048        # max label value length in bytes

  # Query limits
  max_query_series: 5000              # max streams returned per query
  max_entries_limit_per_query: 50000  # max log entries per query
  max_query_lookback: 30d             # how far back queries can reach
  query_timeout: 5m                   # per-query timeout

  # Retention — compactor deletes chunks older than this
  retention_period: 30d               # global retention: 30 days
  # Override per stream via per_tenant_override_config for longer retention

compactor:
  working_directory: /loki/compactor
  compaction_interval: 10m
  retention_enabled: true        # enable retention deletion
  retention_delete_delay: 2h     # wait before deleting (safety window)
  retention_delete_worker_count: 150
  delete_request_store: filesystem

querier:
  max_concurrent: 20
  engine:
    max_look_back_period: 30d    # must match retention_period

query_range:
  results_cache:
    cache:
      enable_fifocache: true
      fifocache:
        max_size_bytes: 1073741824  # 1 GB query result cache
        validity: 10m

ruler:
  storage:
    type: local
    local:
      directory: /loki/rules
  rule_path: /loki/rules-temp
  alertmanager_url: http://alertmanager:9093
  enable_api: true

Note

For production deployments on Kubernetes, replace the filesystem object store with S3. Add a storage_config.aws block with your bucket name, region, and access credentials (or use an IAM role). The TSDB index should also be stored in S3 by setting object_store: s3 in the schema config. See the Loki storage documentation for the full S3 configuration block, including SSE-KMS encryption and endpoint overrides for MinIO.

Promtail: Scraping Logs from Files and Docker

Promtail is the official log agent for Loki, designed to discover log targets, apply label extraction pipelines, and ship log streams to a Loki push endpoint. It supports file tailing, Docker container log discovery via the Docker socket, Kubernetes pod log discovery via the API, syslog (UDP/TCP), and HTTP push. The following configuration handles both file-based logs and Docker container logs with automatic label extraction.

# config/promtail.yaml
server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /tmp/positions.yaml    # tracks read offsets to survive restarts

clients:
  - url: http://loki:3100/loki/api/v1/push
    batchwait: 1s                  # wait up to 1s before sending batch
    batchsize: 1048576             # send when batch reaches 1 MB
    timeout: 10s
    backoff_config:
      min_period: 500ms
      max_period: 5m
      max_retries: 10

scrape_configs:

  # ── Docker container logs ────────────────────────────────────────────────
  - job_name: docker
    docker_sd_configs:
      - host: unix:///var/run/docker.sock
        refresh_interval: 5s
        filters:
          - name: status
            values: ["running"]
    relabel_configs:
      # Map Docker metadata to Loki labels
      - source_labels: ["__meta_docker_container_name"]
        regex: "/(.*)"
        target_label: container
      - source_labels: ["__meta_docker_container_log_stream"]
        target_label: logstream      # stdout or stderr
      - source_labels: ["__meta_docker_container_label_com_docker_compose_service"]
        target_label: service
      - source_labels: ["__meta_docker_container_label_com_docker_compose_project"]
        target_label: compose_project
      # Drop containers without a compose service label (unrelated containers)
      - source_labels: ["service"]
        regex: ".+"
        action: keep

    pipeline_stages:
      # Parse JSON log lines (for services that emit structured logs)
      - json:
          expressions:
            level: level
            msg: msg
            ts: ts
            trace_id: trace_id
      # Extract log level as a label (low cardinality — safe!)
      - labels:
          level:
      # Convert timestamp from service-emitted value to Loki's expected format
      - timestamp:
          source: ts
          format: RFC3339Nano
          fallback_formats:
            - RFC3339
            - UnixMs
      # Drop debug logs in production to reduce volume
      - match:
          selector: '{service=~".+"}'
          stages:
            - drop:
                expression: '.*'
                drop_counter_reason: debug_log_dropped
          # Only apply drop when level == debug
          pipeline_stages:
            - json:
                expressions:
                  level: level
            - drop:
                source: level
                expression: "debug"
                drop_counter_reason: debug_filtered

  # ── System logs ──────────────────────────────────────────────────────────
  - job_name: system
    static_configs:
      - targets:
          - localhost
        labels:
          job: system
          env: prod
          host: __HOSTNAME__
          __path__: /var/log/{syslog,auth.log,kern.log}
    pipeline_stages:
      # Extract syslog severity as label
      - regex:
          expression: '^(?P<timestamp>w+s+d+ d+:d+:d+) (?P<host>S+) (?P<process>S+)[(?P<pid>d+)]: (?P<message>.*)'
      - labels:
          host:
          process:
      - timestamp:
          source: timestamp
          format: "Jan _2 15:04:05"

  # ── Application log files ─────────────────────────────────────────────────
  - job_name: app_logs
    static_configs:
      - targets:
          - localhost
        labels:
          job: api
          env: prod
          __path__: /var/log/myapp/*.log
    pipeline_stages:
      - multiline:
          # Treat lines not starting with timestamp as continuation of previous
          firstline: '^d{4}-d{2}-d{2}Td{2}:d{2}:d{2}'
          max_wait_time: 3s
          max_lines: 128
      - json:
          expressions:
            level: level
            trace_id: trace_id
            user_id: user_id
            status_code: status_code
      - labels:
          level:
          status_code:      # low-cardinality: 200, 400, 500, etc.
      - output:
          source: message   # use parsed "message" field as log line

LogQL: Querying Logs Like Metrics

LogQL is Loki's query language, inspired by PromQL. It has two expression types: log queries (return raw log lines) and metric queries (aggregate log data into time-series metrics). Log queries start with a log stream selector — a set of label matchers in curly braces — followed by optional filter and parsing stages. Metric queries wrap a log query in an aggregation function like rate(), count_over_time(), or sum by().

# ── Log Queries ──────────────────────────────────────────────────────────

# Stream selector: required. All logs from the api service in prod
{service="api", env="prod"}

# Filter expressions: chain with |
{service="api", env="prod"} |= "ERROR"                    # contains string
{service="api", env="prod"} != "health_check"              # does not contain
{service="api", env="prod"} |~ "timeout|connection reset"  # regex match
{service="api", env="prod"} !~ "GET /metrics"              # regex not match

# Parser + filter: parse JSON, then filter on extracted field
{service="api", env="prod"} | json | level="error" | status_code >= 500

# Pattern parser: extract fields from unstructured nginx logs
# Pattern uses backtick-delimited named capture groups (see Loki docs):
#   {service="nginx"} | pattern `<ip> - - [<ts>] "<method> <path>" <status> <bytes>`
# Equivalent using regexp:
{service="nginx", env="prod"}
  | regexp "(?P<ip>[\d.]+) - - \[(?P<ts>[^\]]+)\] \"(?P<method>\S+) (?P<path>\S+)[^\"]* (?P<status>\d+) (?P<bytes>\d+)"
  | status >= 500

# Logfmt parser (for key=value log lines)
{job="system"} | logfmt | process="sshd" | msg =~ "Failed password"

# Label filter after parsing: complex conditions
{service="api"} | json
  | duration > 1s           # numeric comparison (parsed from "duration": "1.2s")
  | status_code != 200
  | user_id != ""           # only logs with a non-empty user_id


# ── Metric Queries ────────────────────────────────────────────────────────

# Log rate: lines per second over a 5-minute range
rate({service="api", env="prod"} [5m])

# Error rate per service over the last hour
sum by (service) (
  rate({env="prod"} | json | level="error" [1h])
)

# HTTP 5xx error rate
sum by (service) (
  rate({service=~"api|worker"} | json | status_code >= 500 [5m])
)
/
sum by (service) (
  rate({service=~"api|worker"} | json [5m])
)

# 99th percentile latency from parsed JSON logs
quantile_over_time(0.99,
  {service="api"} | json | unwrap duration_ms [5m]
) by (service)

# Count unique users in last 24h (count_distinct — Loki 2.9+)
count_distinct(
  {service="api"} | json | user_id != "" | keep user_id
) by (service)

# Top 10 slowest endpoints
topk(10,
  avg_over_time(
    {service="api"} | json | unwrap duration_ms [5m]
  ) by (path)
)

Note

LogQL metric queries over large time ranges are expensive because they must scan all matching chunks. Use the query_range cache in Loki config and split long-range queries automatically via the query frontend (split_queries_by_interval: 15m). For frequently queried aggregations — error rates, request counts — consider a Prometheus recording rule or a Loki-native metric extraction rule to pre-compute the time series rather than scanning logs on every dashboard load.

Alertmanager: Routing, Grouping, and Inhibition

Prometheus Alertmanager handles alert routing from both Prometheus (metric alerts) and Loki Ruler (log-based alerts). It groups related alerts to reduce notification noise, routes alerts to the appropriate receivers based on label matchers, silences alerts during maintenance windows, and inhibits child alerts when a parent alert is already firing. Understanding Alertmanager's routing tree is essential to avoiding alert storms and missed pages.

# config/alertmanager.yml
global:
  smtp_smarthost: "smtp.example.com:587"
  smtp_from: "alertmanager@example.com"
  smtp_auth_username: "alertmanager@example.com"
  smtp_auth_password: "${SMTP_PASSWORD}"   # injected via env or secret mount
  resolve_timeout: 5m                        # wait before sending resolved notice

# Notification templates
templates:
  - "/etc/alertmanager/templates/*.tmpl"

# ── Routing tree ──────────────────────────────────────────────────────────
# Routes are evaluated top-to-bottom. First match wins (unless continue: true).
route:
  receiver: "default-slack"
  group_by: ["alertname", "env", "service"]
  group_wait: 30s         # wait before sending first notification (group more alerts)
  group_interval: 5m      # resend after group changes
  repeat_interval: 4h     # resend if alert stays active (prevent alert fatigue)

  routes:
    # Critical alerts: page the on-call engineer immediately
    - match:
        severity: critical
      receiver: pagerduty-critical
      group_wait: 10s         # page fast
      repeat_interval: 30m    # re-page every 30 minutes until resolved

    # Production errors: Slack #alerts-prod channel
    - match:
        env: prod
        severity: error
      receiver: slack-prod-errors
      group_interval: 2m
      repeat_interval: 2h

    # Staging: no pages, just Slack
    - match:
        env: staging
      receiver: slack-staging
      repeat_interval: 24h

    # Silenced: send to a black-hole during maintenance
    - match:
        alertname: "Watchdog"
      receiver: "devnull"

# ── Receivers ─────────────────────────────────────────────────────────────
receivers:
  - name: "devnull"

  - name: "default-slack"
    slack_configs:
      - api_url: "${SLACK_WEBHOOK_URL}"
        channel: "#alerts-dev"
        send_resolved: true
        title: '{{ template "slack.default.title" . }}'
        text: '{{ template "slack.default.text" . }}'

  - name: "slack-prod-errors"
    slack_configs:
      - api_url: "${SLACK_WEBHOOK_URL_PROD}"
        channel: "#alerts-prod"
        send_resolved: true
        color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'
        title: '[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .CommonLabels.alertname }}'
        text: |
          {{ range .Alerts }}
          *Alert:* {{ .Annotations.summary }}
          *Description:* {{ .Annotations.description }}
          *Service:* {{ .Labels.service }}
          *Env:* {{ .Labels.env }}
          *Runbook:* {{ .Annotations.runbook_url }}
          {{ end }}

  - name: "pagerduty-critical"
    pagerduty_configs:
      - routing_key: "${PAGERDUTY_ROUTING_KEY}"
        severity: '{{ .CommonLabels.severity }}'
        description: '{{ .CommonAnnotations.summary }}'
        details:
          service: '{{ .CommonLabels.service }}'
          env: '{{ .CommonLabels.env }}'
          runbook: '{{ .CommonAnnotations.runbook_url }}'

# ── Inhibition rules ──────────────────────────────────────────────────────
# Suppress child alerts when parent is already firing
inhibit_rules:
  # If a node is down, inhibit all service alerts on that node
  - source_match:
      alertname: NodeDown
      severity: critical
    target_match:
      severity: error
    equal: ["host"]

  # If the entire cluster is unreachable, inhibit individual service alerts
  - source_match:
      alertname: ClusterUnreachable
    target_match_re:
      alertname: .+
    equal: ["env"]

Ruler: Log-Based Alerting with LogQL

Loki's Ruler component evaluates LogQL metric queries on a schedule and fires alerts to Alertmanager, exactly like Prometheus recording and alert rules — but sourced from log data. This allows you to alert on log patterns: error rates derived from log parsing, the absence of expected log lines (heartbeat monitoring), and SLO violations computed from log-extracted latency values.

# config/loki/rules/prod/api-rules.yaml
# Place in Loki's rules directory: /loki/rules/fake/ (tenant "fake" for single-tenant)
groups:
  - name: api_errors
    interval: 1m      # evaluate every minute
    rules:

      # Alert when error rate exceeds 5% over 5 minutes
      - alert: HighAPIErrorRate
        expr: |
          (
            sum by (service) (
              rate({env="prod"} | json | level="error" [5m])
            )
            /
            sum by (service) (
              rate({env="prod"} | json [5m])
            )
          ) > 0.05
        for: 2m       # must be true for 2 minutes before firing
        labels:
          severity: error
          env: prod
        annotations:
          summary: "High error rate on {{ $labels.service }}"
          description: "Error rate is {{ $value | humanizePercentage }} on {{ $labels.service }} in prod"
          runbook_url: "https://runbooks.internal/api-errors"

      # Alert when error rate exceeds 20% — critical, page immediately
      - alert: CriticalAPIErrorRate
        expr: |
          (
            sum by (service) (
              rate({env="prod"} | json | level="error" [5m])
            )
            /
            sum by (service) (
              rate({env="prod"} | json [5m])
            )
          ) > 0.20
        for: 1m
        labels:
          severity: critical
          env: prod
        annotations:
          summary: "Critical error rate on {{ $labels.service }}"
          description: "Error rate is {{ $value | humanizePercentage }} — exceeds 20% critical threshold"
          runbook_url: "https://runbooks.internal/api-errors-critical"

      # Alert on high p99 latency extracted from log lines
      - alert: APIHighLatency
        expr: |
          quantile_over_time(0.99,
            {service="api", env="prod"} | json | unwrap duration_ms [5m]
          ) by (service) > 2000
        for: 5m
        labels:
          severity: error
          env: prod
        annotations:
          summary: "p99 latency above 2s on {{ $labels.service }}"
          description: "p99 latency is {{ $value | humanizeDuration }}ms"

  - name: heartbeat
    interval: 2m
    rules:
      # Alert if no logs received in last 5 minutes (dead service / log pipeline broken)
      - alert: ServiceLogsMissing
        expr: |
          absent_over_time(
            {service="api", env="prod"}[5m]
          )
        for: 0m       # fire immediately — absence of logs is urgent
        labels:
          severity: error
          env: prod
        annotations:
          summary: "No logs from api service in prod for 5 minutes"
          description: "Either the service is down or the log pipeline (Promtail) has failed"
          runbook_url: "https://runbooks.internal/missing-logs"

  - name: security
    interval: 1m
    rules:
      # Alert on repeated SSH authentication failures (brute force detection)
      - alert: SSHBruteForce
        expr: |
          sum by (host) (
            count_over_time(
              {job="system"}
              | logfmt
              | msg =~ "Failed password"
              [5m]
            )
          ) > 20
        for: 0m
        labels:
          severity: critical
          team: security
        annotations:
          summary: "Possible SSH brute force on {{ $labels.host }}"
          description: "More than 20 failed SSH attempts in 5 minutes"

---
# config/prometheus/rules/infra-alerts.yaml
# Prometheus metric-based alerts (complement to Loki log-based alerts)
groups:
  - name: infrastructure
    rules:
      - alert: NodeDown
        expr: up{job="node_exporter"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Node {{ $labels.instance }} is down"

      - alert: HighMemoryUsage
        expr: |
          (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) > 0.90
        for: 5m
        labels:
          severity: error
        annotations:
          summary: "Memory usage above 90% on {{ $labels.instance }}"
          description: "Current usage: {{ $value | humanizePercentage }}"

      - alert: LokiIngesterHighMemory
        expr: |
          container_memory_working_set_bytes{container="loki"} /
          container_spec_memory_limit_bytes{container="loki"} > 0.85
        for: 3m
        labels:
          severity: error
          service: loki
        annotations:
          summary: "Loki ingester memory pressure — cardinality issue likely"
          description: "Check for high-cardinality label values being pushed to Loki"

Grafana Provisioning: Datasources and Dashboards as Code

Grafana supports provisioning datasources and dashboards from YAML configuration files, enabling GitOps management of your observability UI. The following provisioning configuration connects Grafana to both Loki and Prometheus, and enables correlations — the feature that lets you jump from a metric spike in Prometheus directly to the correlated log lines in Loki using shared label values.

# config/grafana/datasources/datasources.yaml
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    uid: prometheus-uid
    isDefault: true
    jsonData:
      timeInterval: "15s"
      httpMethod: POST
      exemplarTraceIdDestinations:
        - name: trace_id
          datasourceUid: tempo-uid    # link to Tempo if you have distributed tracing

  - name: Loki
    type: loki
    access: proxy
    url: http://loki:3100
    uid: loki-uid
    jsonData:
      maxLines: 1000
      timeout: 300
      # Derived fields: extract trace_id from logs and link to Tempo
      derivedFields:
        - matcherRegex: '"trace_id":"([^"]+)"'
          name: TraceID
          url: '${__value.raw}'
          datasourceUid: tempo-uid

  - name: Alertmanager
    type: alertmanager
    access: proxy
    url: http://alertmanager:9093
    uid: alertmanager-uid
    jsonData:
      implementation: prometheus

# config/grafana/dashboards/dashboards.yaml
apiVersion: 1

providers:
  - name: default
    orgId: 1
    folder: ""
    type: file
    disableDeletion: false
    updateIntervalSeconds: 30
    allowUiUpdates: true
    options:
      path: /etc/grafana/provisioning/dashboards/json
      foldersFromFilesStructure: true

Cardinality Management: The Most Common Production Problem

Cardinality is the number of unique label combinations (streams) in Loki. Every unique stream consumes memory in the ingester for its active chunk. High cardinality is the root cause of most Loki production incidents: ingesters running out of memory, slow query planning, and disk I/O spikes from too many tiny chunk files.

The most common cardinality mistakes are using Kubernetes pod names, request trace IDs, or user IDs as Loki labels. A 1000-pod deployment creates 1000 streams just from the pod name label. Use Loki's cardinality analysis tools and the Grafana Loki dashboard to identify problematic labels before they become incidents.

# Cardinality analysis via Loki API

# List top streams by number of chunks (find high-volume streams)
curl -G 'http://loki:3100/loki/api/v1/series'   --data-urlencode 'match[]={env="prod"}'   --data-urlencode 'start=1746000000000000000'   --data-urlencode 'end=1746086400000000000'   | jq '.data | length'   # total stream count

# Use the query API to find streams with extremely high ingestion rates
curl -G 'http://loki:3100/loki/api/v1/query_range'   --data-urlencode 'query=topk(20, sum by (pod) (rate({env="prod"}[5m])))'   --data-urlencode 'step=60'   --data-urlencode 'start=1746000000'   --data-urlencode 'end=1746086400'   | jq '.data.result[] | {pod: .metric.pod, rate: .values[-1][1]}'

# Promtail label relabeling to reduce cardinality
# In promtail.yaml scrape_configs, use relabel_configs to DROP high-cardinality labels

scrape_configs:
  - job_name: kubernetes_pods
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      # KEEP only low-cardinality labels
      - source_labels: [__meta_kubernetes_namespace]
        target_label: namespace
      - source_labels: [__meta_kubernetes_pod_label_app]
        target_label: app
      - source_labels: [__meta_kubernetes_pod_label_version]
        target_label: version
      # DROP pod name — use app+namespace instead
      # (pod name is high-cardinality: different for every restart)
      - action: labeldrop
        regex: pod

      # DROP any label matching these patterns (Helm-generated labels)
      - action: labeldrop
        regex: (helm_chart|chart|heritage|managed_by)

# Loki limits: enforce cardinality limits per tenant in limits_config:
limits_config:
  max_streams_per_user: 10000       # alert if approaching this
  max_streams_matchers_per_query: 1000

Production Hardening: S3 Storage and Helm Deployment

For production Kubernetes deployments, the Grafana Loki Helm chart is the recommended deployment path. The following values configure Loki in microservices mode with S3 object storage, appropriate resource requests, and production-ready compaction.

# values-loki-prod.yaml for grafana/loki Helm chart (Loki 3.x)
deploymentMode: Distributed    # separate ingester, querier, distributor pods

loki:
  auth_enabled: false           # set true for multi-tenant
  commonConfig:
    replication_factor: 2       # tolerate 1 ingester pod failure
  storage:
    type: s3
    s3:
      region: eu-central-1
      bucketnames: my-loki-chunks-prod
      # Use IRSA (IAM Role for Service Account) instead of static credentials
      # Annotate the ServiceAccount below with the role ARN
  structuredConfig:
    limits_config:
      ingestion_rate_mb: 64
      ingestion_burst_size_mb: 128
      per_stream_rate_limit: 8MB
      max_streams_per_user: 10000
      retention_period: 30d
    compactor:
      retention_enabled: true
      compaction_interval: 10m
    schema_config:
      configs:
        - from: "2024-01-01"
          store: tsdb
          object_store: s3
          schema: v13
          index:
            prefix: loki_index_
            period: 24h

# Ingester: stateful, needs PVC for WAL
ingester:
  replicas: 3
  persistence:
    enabled: true
    storageClass: gp3
    size: 10Gi
  resources:
    requests:
      cpu: 500m
      memory: 1Gi
    limits:
      memory: 2Gi
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchLabels:
              app.kubernetes.io/component: ingester
          topologyKey: kubernetes.io/hostname

querier:
  replicas: 2
  resources:
    requests:
      cpu: 250m
      memory: 512Mi
    limits:
      memory: 1Gi

distributor:
  replicas: 2
  resources:
    requests:
      cpu: 250m
      memory: 256Mi

compactor:
  replicas: 1
  persistence:
    enabled: true
    storageClass: gp3
    size: 10Gi

serviceAccount:
  create: true
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789:role/loki-s3-role

# Deploy Promtail as DaemonSet
promtail:
  enabled: true
  config:
    clients:
      - url: http://{{ .Release.Name }}-loki-gateway/loki/api/v1/push

Note

Enable Loki's write-ahead log (WAL) on all ingester pods. Without WAL, a pod crash before chunk flush loses up to chunk_idle_period (default: 5 minutes) of log data. With WAL enabled and a persistent volume mounted at /loki/wal, ingesters replay the WAL on restart and recover in-flight chunks. The performance overhead is small — typically 5–10% extra disk I/O per ingester.

ELK vs PLG: When to Choose Which

Loki is not a drop-in replacement for Elasticsearch in all scenarios. The architectural differences translate into real capability differences that should inform your choice. Understanding when each stack excels prevents re-platforming projects caused by choosing the wrong tool for the use case.

Choose Loki + Grafana when...

You already run Prometheus and Grafana for metrics. Your primary log use case is operational troubleshooting — navigating from a metric alert to the correlated log context. You have structured (JSON) or semi-structured logs that can be parsed with LogQL. Storage cost is a priority: Loki stores 10–30x more logs per GB than Elasticsearch because it compresses raw content rather than indexing fields. Your team prefers a single unified UI for metrics, logs, and traces (Grafana with Tempo).

Choose Elasticsearch + Kibana when...

You need full-text search over log content without knowing the exact field names in advance. Your compliance or security team requires fast arbitrary field queries across months of data — SIEM use cases where analysts search across hundreds of fields in unstructured log data. You need Kibana's visual aggregation builder for non-engineers. You have existing Logstash pipelines and enrichment logic that would be costly to rewrite. Your log volume is moderate (under 50 GB/day) and the operational cost of Elasticsearch is acceptable.

Running both stacks

Large organizations often run both: Loki for operational/DevOps log access (cheap, fast for structured queries, integrated with Grafana), and Elasticsearch for security/compliance log retention (full-text search, SIEM integrations, long-term audit). Route application and infrastructure logs to Loki via Promtail; route security events and audit trails to Elasticsearch via Filebeat or Logstash. The two stacks serve different audiences and shouldn't need to compete.

Building or migrating your observability stack to the PLG (Prometheus, Loki, Grafana) model?

We design and implement production observability stacks — from Loki deployment and Promtail pipeline configuration to LogQL-based alerting, Alertmanager routing trees, and Grafana dashboard provisioning. Let’s talk.

Get in Touch

Grafana + Loki + Alertmanager — Complete Observability Stack Without Elasticsearch

Why Replace Elasticsearch for Logs?

Loki Architecture: Labels, Streams, and Chunks

Distributor

Ingester

Querier

Compactor

Full Stack: Docker Compose Deployment

Loki Configuration: Storage, Limits, and Retention

Promtail: Scraping Logs from Files and Docker

LogQL: Querying Logs Like Metrics

Alertmanager: Routing, Grouping, and Inhibition

Ruler: Log-Based Alerting with LogQL

Grafana Provisioning: Datasources and Dashboards as Code

Cardinality Management: The Most Common Production Problem

Production Hardening: S3 Storage and Helm Deployment

ELK vs PLG: When to Choose Which

Choose Loki + Grafana when...

Choose Elasticsearch + Kibana when...

Running both stacks

Building or migrating your observability stack to the PLG (Prometheus, Loki, Grafana) model?

Related Articles