What industries do you work with?

We work across a wide range of industries including finance, healthcare, e-commerce, logistics, and telecommunications. Our solutions are tailored to each client’s specific domain requirements and regulatory environment.

How long does a typical engagement take?

It depends on the scope. A focused observability deployment or automation workflow can be delivered in 4-6 weeks. Larger initiatives like full-scale LLM integration or platform builds typically run 2-4 months. We always start with a discovery phase to align on timelines.

Do you offer ongoing support after project delivery?

Yes. We offer flexible support and maintenance plans to ensure your systems stay healthy, updated, and optimized. We can also embed with your team on a part-time basis for continuous improvement.

Can you work with our existing tech stack?

Absolutely. We integrate with your current infrastructure and tools rather than forcing a rip-and-replace. Whether you’re on AWS, GCP, Azure, or on-prem, we adapt our approach to what works best for your environment.

What is your pricing model?

We offer both fixed-price project engagements and time-and-materials contracts depending on the nature of the work. Reach out through our contact form and we’ll provide a tailored estimate within 24 hours.

How do you handle data security and compliance?

Security is built into every engagement. We follow industry best practices for data handling, support GDPR and SOC 2 compliance requirements, and can work within your existing security policies and access controls.

es-snapshot-check — Zero-Dependency CLI to Validate Elasticsearch Snapshot Health

Why Most Teams Never Validate Snapshots Until It's Too Late

Elasticsearch snapshots are the primary disaster recovery mechanism for most clusters. They protect against hardware failure, accidental index deletion, configuration mistakes, and cluster-wide corruption. Yet an estimated 80% of teams running Elasticsearch in production have never performed a full restore test — and many never check whether their snapshots are actually valid until the moment they need them.

The problem is operational inertia. Snapshot repositories are configured once during cluster setup, SLM policies are enabled, and the team moves on — trusting that because snapshot creation did not throw an error, the data is safe. In practice, repositories become unreachable (S3 bucket policy changes, GCS credential rotation, NFS mount failures), SLM policies silently stop executing (after cluster restarts, permission changes, or version upgrades), and snapshot metadata becomes corrupt. None of these failures produce an alert unless you actively check for them.

es-snapshot-check is a zero-dependency bash CLI that performs four checks on every run: repository accessibility, repository storage verification, SLM policy health, and snapshot freshness. It produces Nagios-compatible exit codes (0/1/2/3), supports JSON output for log aggregation, and emits Prometheus textfile metrics for continuous alerting — all with no external dependencies beyond curl and a bash 4+ shell.

Approach	Checks	Alert integration	Dependencies
Manual curl	Ad-hoc only	None	curl, human
Kibana Stack Monitoring	Status only, no age check	Kibana alerting only	Kibana, Watcher license
Custom Python script	Whatever you code	Custom	Python, elasticsearch-py
es-snapshot-check	Repo + verify + SLM + freshness	Nagios, Prometheus, JSON	curl + bash only

The Four Checks That Matter

Before diving into the implementation, it helps to understand what each check catches and why it cannot be safely skipped. Many monitoring setups perform only one or two of these checks, leaving blind spots that only appear at the worst possible moment.

1. Repository accessibility

Calls GET _cat/repositories?format=json and verifies that at least one repository is registered. An empty repository list means either no backup has been configured, or repositories were accidentally deregistered — neither is recoverable without operator intervention. This check is instantaneous and should never fail in a healthy cluster.

2. Repository storage verification

Calls POST _snapshot/{repo}/_verify to confirm that every data node can reach the underlying storage. A repository can be registered but unreachable if the S3 endpoint changed, a GCS service account was rotated, or an NFS mount dropped. The verify API creates a temporary test file in the repository and deletes it — if any node fails the round-trip, the API returns an error. Without this check, snapshot creation may succeed on the master node but silently fail on primaries.

3. SLM policy health

Calls GET _slm/policy and checks that each policy's last_success is more recent than the policy schedule interval (with a 2x grace multiplier). SLM policies stop executing silently after certain cluster operations — particularly during rolling restarts, version upgrades, and security plugin reconfiguration. A policy that last succeeded three days ago on an hourly schedule is a broken backup, even if the policy itself shows as active.

4. Snapshot freshness

Calls GET _snapshot/{repo}/_all?size=5&order=desc (most recent 5 snapshots) and verifies that at least one has state SUCCESS within the configured warning window (default 25h) and critical window (default 49h). This catches all the cases that fall through the previous checks: snapshots that start but fail partway through, snapshots that create a metadata entry but fail to write shard data, and missed schedule windows caused by cluster overload.

Full Implementation

The complete script is 160 lines of bash. It uses only curl for HTTP calls and standard POSIX utilities for JSON parsing. No Python, no Go toolchain, no package manager — copy the file to any host with curl and it works. The script is structured in four phases: argument parsing, auth configuration, the four health checks, and output formatting.

#!/usr/bin/env bash
# es-snapshot-check — validate Elasticsearch snapshot health
# https://github.com/datasops/es-snapshot-check
#
# Exit codes (Nagios-compatible):
#   0 = OK      — all checks passed, recent SUCCESS snapshot exists
#   1 = WARNING — snapshot older than --warn-age hours
#   2 = CRITICAL — snapshot older than --crit-age hours, verify failed, or SLM broken
#   3 = UNKNOWN — cannot reach cluster or parse response
#
# Usage:
#   es-snapshot-check [--host URL] [--repo NAME] [--warn-age H] [--crit-age H]
#                     [--slm-policy NAME] [--format text|json|prometheus]
#                     [--user USER:PASS] [--api-key KEY] [--ca-cert FILE]
#                     [--skip-verify] [--skip-slm] [--timeout SEC]

set -euo pipefail

# ── Defaults ───────────────────────────────────────────────────────────────────
ES_HOST="${ES_HOST:-http://localhost:9200}"
REPO=""
WARN_AGE_HOURS=25
CRIT_AGE_HOURS=49
SLM_POLICY=""
FORMAT="text"
AUTH_HEADER=""
CA_CERT=""
SKIP_VERIFY=0
SKIP_SLM=0
TIMEOUT=15

# ── Argument parsing ───────────────────────────────────────────────────────────
while [[ $# -gt 0 ]]; do
  case "$1" in
    --host)       ES_HOST="$2";        shift 2 ;;
    --repo)       REPO="$2";           shift 2 ;;
    --warn-age)   WARN_AGE_HOURS="$2"; shift 2 ;;
    --crit-age)   CRIT_AGE_HOURS="$2"; shift 2 ;;
    --slm-policy) SLM_POLICY="$2";     shift 2 ;;
    --format)     FORMAT="$2";         shift 2 ;;
    --user)       AUTH_HEADER="-u $2"; shift 2 ;;
    --api-key)    AUTH_HEADER="-H 'Authorization: ApiKey $2'"; shift 2 ;;
    --ca-cert)    CA_CERT="--cacert $2"; shift 2 ;;
    --skip-verify) SKIP_VERIFY=1; shift ;;
    --skip-slm)    SKIP_SLM=1;    shift ;;
    --timeout)    TIMEOUT="$2";  shift 2 ;;
    -h|--help)
      sed -n '2,14p' "$0"; exit 0 ;;
    *)
      echo "Unknown option: $1" >&2; exit 3 ;;
  esac
done

# ── Helper: curl wrapper ───────────────────────────────────────────────────────
es_get() {
  local path="$1"
  # word-splitting on AUTH_HEADER is intentional here
  # shellcheck disable=SC2086
  curl -sf --max-time "$TIMEOUT" $AUTH_HEADER $CA_CERT \
    "${ES_HOST}${path}" 2>/dev/null
}

# ── State tracking ─────────────────────────────────────────────────────────────
STATUS=0      # 0=ok 1=warn 2=crit 3=unknown
MESSAGES=()
METRICS=()
NOW_EPOCH=$(date +%s)

raise() {
  local level="$1" msg="$2"
  MESSAGES+=("$msg")
  [[ "$level" -gt "$STATUS" ]] && STATUS="$level"
}

# ── Check 1: Repository list ───────────────────────────────────────────────────
REPOS_JSON=$(es_get "/_cat/repositories?format=json") || {
  raise 3 "UNKNOWN: Cannot reach Elasticsearch at ${ES_HOST}"
  # skip remaining checks
  REPOS_JSON="[]"
}

if [[ "$REPOS_JSON" == "[]" ]]; then
  raise 2 "CRITICAL: No snapshot repositories registered"
fi

# If no --repo specified, use first registered repository
if [[ -z "$REPO" ]]; then
  REPO=$(echo "$REPOS_JSON" | grep -o '"id":"[^"]*"' | head -1 | cut -d'"' -f4)
fi

if [[ -z "$REPO" ]]; then
  raise 3 "UNKNOWN: Could not determine repository name"
fi

METRICS+=("es_snapshot_repository_count $(echo "$REPOS_JSON" | grep -o '"id"' | wc -l | tr -d ' ')")

# ── Check 2: Repository verification ──────────────────────────────────────────
if [[ "$SKIP_VERIFY" -eq 0 ]]; then
  VERIFY_RESP=$(curl -sf --max-time "$TIMEOUT" $AUTH_HEADER $CA_CERT \
    -X POST "${ES_HOST}/_snapshot/${REPO}/_verify" 2>/dev/null) || VERIFY_RESP=""

  if [[ -z "$VERIFY_RESP" ]]; then
    raise 2 "CRITICAL: Repository '${REPO}' verify call failed — storage may be unreachable"
    METRICS+=("es_snapshot_repository_verified{repo="${REPO}"} 0")
  elif echo "$VERIFY_RESP" | grep -q '"error"'; then
    ERR_TYPE=$(echo "$VERIFY_RESP" | grep -o '"type":"[^"]*"' | head -1 | cut -d'"' -f4)
    raise 2 "CRITICAL: Repository '${REPO}' verify error: ${ERR_TYPE}"
    METRICS+=("es_snapshot_repository_verified{repo="${REPO}"} 0")
  else
    METRICS+=("es_snapshot_repository_verified{repo="${REPO}"} 1")
  fi
fi

# ── Check 3: SLM policy health ─────────────────────────────────────────────────
if [[ "$SKIP_SLM" -eq 0 ]]; then
  SLM_JSON=$(es_get "/_slm/policy${SLM_POLICY:+/${SLM_POLICY}}") || SLM_JSON="{}"

  # Parse last_success_time for each policy and compare to policy schedule
  # Simplified: flag any policy whose last_success is > 2 * crit-age hours ago
  CRIT_EPOCH=$(( NOW_EPOCH - CRIT_AGE_HOURS * 2 * 3600 ))

  while IFS= read -r policy_name; do
    last_success=$(echo "$SLM_JSON" | grep -A5 ""name":"${policy_name}"" \
      | grep '"last_success_date_millis"' | grep -o '[0-9]*' | head -1)

    if [[ -z "$last_success" ]]; then
      raise 1 "WARNING: SLM policy '${policy_name}' has no recorded successful execution"
      METRICS+=("es_slm_policy_last_success_age_seconds{policy="${policy_name}"} -1")
      continue
    fi

    last_success_epoch=$(( last_success / 1000 ))
    age_seconds=$(( NOW_EPOCH - last_success_epoch ))

    METRICS+=("es_slm_policy_last_success_age_seconds{policy="${policy_name}"} ${age_seconds}")

    if [[ "$last_success_epoch" -lt "$CRIT_EPOCH" ]]; then
      raise 2 "CRITICAL: SLM policy '${policy_name}' last succeeded $(( age_seconds / 3600 ))h ago"
    fi
  done < <(echo "$SLM_JSON" | grep -o '"name":"[^"]*"' | cut -d'"' -f4 | sort -u)
fi

# ── Check 4: Snapshot freshness ────────────────────────────────────────────────
SNAP_JSON=$(es_get "/_snapshot/${REPO}/_all?size=10&order=desc&slm_policy_filter=${SLM_POLICY:-*}") \
  || SNAP_JSON='{"snapshots":[]}'

# Find most recent SUCCESS snapshot
LATEST_SUCCESS_MS=""
SNAP_COUNT=0

while IFS= read -r line; do
  if echo "$line" | grep -q '"state":"SUCCESS"'; then
    SNAP_COUNT=$(( SNAP_COUNT + 1 ))
    ts=$(echo "$line" | grep -o '"start_time_in_millis":[0-9]*' | grep -o '[0-9]*$')
    if [[ -n "$ts" ]]; then
      [[ -z "$LATEST_SUCCESS_MS" || "$ts" -gt "$LATEST_SUCCESS_MS" ]] \
        && LATEST_SUCCESS_MS="$ts"
    fi
  fi
done < <(echo "$SNAP_JSON" | tr '{' '\n')

METRICS+=("es_snapshot_success_count{repo="${REPO}"} ${SNAP_COUNT}")

if [[ -z "$LATEST_SUCCESS_MS" ]]; then
  raise 2 "CRITICAL: No SUCCESS snapshots found in repository '${REPO}'"
  METRICS+=("es_snapshot_latest_success_age_seconds{repo="${REPO}"} -1")
else
  LATEST_SUCCESS_EPOCH=$(( LATEST_SUCCESS_MS / 1000 ))
  AGE_SECONDS=$(( NOW_EPOCH - LATEST_SUCCESS_EPOCH ))
  AGE_HOURS=$(( AGE_SECONDS / 3600 ))

  METRICS+=("es_snapshot_latest_success_age_seconds{repo="${REPO}"} ${AGE_SECONDS}")

  WARN_EPOCH=$(( NOW_EPOCH - WARN_AGE_HOURS * 3600 ))
  CRIT_EPOCH=$(( NOW_EPOCH - CRIT_AGE_HOURS * 3600 ))

  if [[ "$LATEST_SUCCESS_EPOCH" -lt "$CRIT_EPOCH" ]]; then
    raise 2 "CRITICAL: Latest SUCCESS snapshot is ${AGE_HOURS}h old (threshold: ${CRIT_AGE_HOURS}h)"
  elif [[ "$LATEST_SUCCESS_EPOCH" -lt "$WARN_EPOCH" ]]; then
    raise 1 "WARNING: Latest SUCCESS snapshot is ${AGE_HOURS}h old (threshold: ${WARN_AGE_HOURS}h)"
  else
    MESSAGES+=("OK: Latest SUCCESS snapshot ${AGE_HOURS}h old, repo '${REPO}' verified")
  fi
fi

METRICS+=("es_snapshot_check_status ${STATUS}")

# ── Output ─────────────────────────────────────────────────────────────────────
STATUS_LABEL=("OK" "WARNING" "CRITICAL" "UNKNOWN")

case "$FORMAT" in
  json)
    printf '{"status":%d,"label":"%s","messages":[' \
      "$STATUS" "${STATUS_LABEL[$STATUS]}"
    first=1
    for msg in "${MESSAGES[@]:-}"; do
      [[ -z "$msg" ]] && continue
      [[ "$first" -eq 0 ]] && printf ','
      printf '"%s"' "$msg"
      first=0
    done
    printf '],"repo":"%s","timestamp":%d}\n' "$REPO" "$NOW_EPOCH"
    ;;
  prometheus)
    printf '# HELP es_snapshot_check_status 0=ok 1=warn 2=crit 3=unknown\n'
    printf '# TYPE es_snapshot_check_status gauge\n'
    for metric in "${METRICS[@]:-}"; do
      [[ -z "$metric" ]] && continue
      printf '%s\n' "$metric"
    done
    ;;
  *)
    for msg in "${MESSAGES[@]:-}"; do
      [[ -n "$msg" ]] && echo "$msg"
    done
    ;;
esac

exit "$STATUS"

Note

The _snapshot/{repo}/_verify API call is the most expensive operation in this script — it performs a round-trip to the underlying storage from every data node. On large clusters with remote storage, this can take 2–5 seconds. Set --timeout 30 when running verify checks in cron or monitoring scripts to avoid false UNKNOWN results from the default 15-second timeout.

Installation and Quick Start

Since the script has no external dependencies, installation is a single curl command. The recommended location is /usr/local/bin/ so it is available system-wide. The script is compatible with bash 4.0+, which is the default on all major Linux distributions since 2009.

# Install from GitHub releases
curl -fsSL https://github.com/datasops/es-snapshot-check/releases/latest/download/es-snapshot-check \
  -o /usr/local/bin/es-snapshot-check
chmod +x /usr/local/bin/es-snapshot-check

# Verify installation
es-snapshot-check --help

# Basic check — uses defaults: localhost:9200, first registered repo
es-snapshot-check

# Check a specific host with basic auth
es-snapshot-check \
  --host https://es.internal:9200 \
  --user elastic:changeme \
  --repo my_s3_repo

# Check with API key auth (Elastic Cloud / ECE)
es-snapshot-check \
  --host https://my-cluster.es.us-east-1.aws.elastic-cloud.com:9243 \
  --api-key VnVhQ2ZHY0JDZGJrUFRmVFVp \
  --repo found-snapshots \
  --warn-age 26 \
  --crit-age 50

# Check a specific SLM policy by name
es-snapshot-check \
  --host https://es.prod.internal:9200 \
  --repo logs-s3 \
  --slm-policy hourly-logs-snapshot \
  --warn-age 2 \
  --crit-age 4

# JSON output for log aggregation (Splunk, Loki, CloudWatch Logs)
es-snapshot-check --format json | tee /var/log/es-snapshot-check.json

# Prometheus textfile format — write to node_exporter textfile directory
es-snapshot-check \
  --format prometheus \
  --repo logs-s3 > /var/lib/node_exporter/textfile_collector/es_snapshots.prom

# Skip verify (faster, use when verify latency is a concern)
es-snapshot-check --skip-verify --format json

# Skip SLM check (for clusters without SLM or Basic+ license)
es-snapshot-check --skip-slm

Exit Code Reference

Exit codes follow the Nagios plugin standard, making the script directly usable as a passive check in Nagios, Icinga, Zabbix, or any monitoring system that supports the Nagios plugin interface. The highest severity of all failed checks determines the final exit code.

Exit code	Label	Meaning
0	OK	All checks passed, recent SUCCESS snapshot found
1	WARNING	Snapshot older than --warn-age, or SLM warning condition
2	CRITICAL	Snapshot too old, repo verify failed, SLM broken, or no snapshots
3	UNKNOWN	Cannot reach cluster, invalid arguments, or unparseable response

Prometheus Textfile Integration

The --format prometheus mode produces Prometheus text exposition format output, suitable for use with node_exporter's textfile collector. Run the check via cron and write the output to the textfile collector directory. node_exporter picks up the metrics on every scrape cycle, making them available in Prometheus for alerting and dashboards without any custom exporter development.

# /etc/cron.d/es-snapshot-check
# Run every 15 minutes; write Prometheus metrics to textfile collector directory
*/15 * * * * root /usr/local/bin/es-snapshot-check \
  --host https://es.internal:9200 \
  --api-key ${ES_API_KEY} \
  --repo logs-s3 \
  --format prometheus \
  > /var/lib/node_exporter/textfile_collector/es_snapshots.prom.tmp \
  && mv /var/lib/node_exporter/textfile_collector/es_snapshots.prom.tmp \
        /var/lib/node_exporter/textfile_collector/es_snapshots.prom

# The tmp+mv pattern prevents node_exporter from reading a partial file
# Store the API key in a secrets manager or /etc/es-snapshot-check.env
# and source it in the cron command, or use --user for basic auth

The script emits four Prometheus metrics. The most important for alerting is es_snapshot_latest_success_age_seconds, which enables threshold-based alerts that scale with your SLA regardless of what units your monitoring rules use.

# Metrics emitted by es-snapshot-check --format prometheus

# Current overall check status (0=ok 1=warn 2=crit 3=unknown)
# TYPE es_snapshot_check_status gauge
es_snapshot_check_status 0

# Number of registered snapshot repositories
# TYPE es_snapshot_repository_count gauge
es_snapshot_repository_count 2

# Whether repository verify succeeded (1=ok, 0=failed)
# TYPE es_snapshot_repository_verified gauge
es_snapshot_repository_verified{repo="logs-s3"} 1

# Number of SUCCESS snapshots in recent window
# TYPE es_snapshot_success_count gauge
es_snapshot_success_count{repo="logs-s3"} 8

# Age in seconds of most recent SUCCESS snapshot (-1 if none)
# TYPE es_snapshot_latest_success_age_seconds gauge
es_snapshot_latest_success_age_seconds{repo="logs-s3"} 3187

# SLM policy last success age in seconds (-1 if no recorded success)
# TYPE es_slm_policy_last_success_age_seconds gauge
es_slm_policy_last_success_age_seconds{policy="hourly-logs-snapshot"} 3187

Alertmanager Rules

With these metrics in Prometheus, write alert rules that page the on-call engineer when snapshots are too old and generate a warning ticket when freshness is degraded. The two-level alert (warning at 25h, critical at 49h for daily snapshots) gives a window to investigate and fix the issue without waking anyone up at 2 AM on the first miss.

# prometheus/rules/elasticsearch-snapshots.yaml
# Apply with: kubectl apply -f elasticsearch-snapshots.yaml
# Or include in your Prometheus operator PrometheusRule CRD

groups:
  - name: elasticsearch.snapshots
    rules:
      # Snapshot freshness — warn before critical
      - alert: ESSnapshotTooOld
        expr: |
          es_snapshot_latest_success_age_seconds{repo=~".+"} > (25 * 3600)
        for: 5m
        labels:
          severity: warning
          team: data-platform
        annotations:
          summary: "Elasticsearch snapshot is stale"
          description: >
            Repository {{ $labels.repo }} last successful snapshot was
            {{ $value | humanizeDuration }} ago.
            Expected: at most 25h. Investigate SLM policy and cluster state.
          runbook_url: "https://wiki.internal/runbooks/es-snapshot-freshness"

      - alert: ESSnapshotCriticallyOld
        expr: |
          es_snapshot_latest_success_age_seconds{repo=~".+"} > (49 * 3600)
        for: 5m
        labels:
          severity: critical
          team: data-platform
          pagerduty: "true"
        annotations:
          summary: "Elasticsearch snapshot is critically stale — DR at risk"
          description: >
            Repository {{ $labels.repo }} last successful snapshot was
            {{ $value | humanizeDuration }} ago.
            RPO SLA breach if cluster fails now. Immediate investigation required.

      # Repository unreachable
      - alert: ESSnapshotRepositoryUnreachable
        expr: |
          es_snapshot_repository_verified == 0
        for: 2m
        labels:
          severity: critical
          team: data-platform
          pagerduty: "true"
        annotations:
          summary: "Elasticsearch snapshot repository storage is unreachable"
          description: >
            Repository {{ $labels.repo }} verify call failed.
            Data nodes cannot write to backup storage.
            Check S3/GCS credentials, bucket policies, and network connectivity.

      # SLM policy stalled
      - alert: ESSnapshotSLMPolicyStalled
        expr: |
          es_slm_policy_last_success_age_seconds > (72 * 3600)
        for: 5m
        labels:
          severity: warning
          team: data-platform
        annotations:
          summary: "Elasticsearch SLM policy has not executed successfully"
          description: >
            SLM policy {{ $labels.policy }} last succeeded
            {{ $value | humanizeDuration }} ago.
            Check cluster state, ILM/SLM status, and Elasticsearch logs.

      # Check script itself is running
      - alert: ESSnapshotCheckNotRunning
        expr: |
          absent(es_snapshot_check_status) == 1
        for: 30m
        labels:
          severity: warning
          team: data-platform
        annotations:
          summary: "es-snapshot-check metrics are missing"
          description: >
            The es_snapshot_check_status metric has not been reported for 30 minutes.
            The cron job may have stopped or node_exporter textfile collector is broken.

Note

The absent(es_snapshot_check_status) == 1 rule catches the case where the monitoring script itself has stopped running — a common failure mode when cron jobs silently stop executing after host reboots or cron daemon restarts. Without this meta-alert, you can have zero snapshot monitoring without any notification.

CI/CD Integration with GitHub Actions

Running es-snapshot-check as part of a deployment pipeline ensures that no deployment proceeds if backups are stale. This is especially useful for stateful deployments — Elasticsearch index mapping changes, data migrations, and cluster version upgrades — where a valid recent snapshot is a prerequisite for safe execution.

# .github/workflows/es-snapshot-gate.yml
# Runs before any deployment job that touches Elasticsearch
# Blocks the workflow if snapshots are stale or repository is unhealthy

name: Elasticsearch Snapshot Gate

on:
  workflow_call:
    inputs:
      es_host:
        description: "Elasticsearch host URL"
        required: true
        type: string
      es_repo:
        description: "Snapshot repository name"
        required: false
        type: string
        default: ""
      warn_age_hours:
        description: "Warning threshold in hours"
        required: false
        type: number
        default: 25
      crit_age_hours:
        description: "Critical threshold in hours"
        required: false
        type: number
        default: 49

jobs:
  snapshot-gate:
    name: Validate ES Snapshots
    runs-on: ubuntu-latest
    steps:
      - name: Install es-snapshot-check
        run: |
          curl -fsSL https://github.com/datasops/es-snapshot-check/releases/latest/download/es-snapshot-check \
            -o /usr/local/bin/es-snapshot-check
          chmod +x /usr/local/bin/es-snapshot-check

      - name: Run snapshot health check
        id: snapshot_check
        env:
          ES_API_KEY: ${{ secrets.ES_API_KEY }}
        run: |
          OUTPUT=$(es-snapshot-check \
            --host "${{ inputs.es_host }}" \
            --api-key "$ES_API_KEY" \
            --repo "${{ inputs.es_repo }}" \
            --warn-age "${{ inputs.warn_age_hours }}" \
            --crit-age "${{ inputs.crit_age_hours }}" \
            --format json)

          echo "result=$OUTPUT" >> "$GITHUB_OUTPUT"

          STATUS=$(echo "$OUTPUT" | grep -o '"status":[0-9]' | cut -d: -f2)

          if [[ "$STATUS" -ge 2 ]]; then
            echo "::error::Snapshot health check FAILED: $OUTPUT"
            exit 2
          elif [[ "$STATUS" -eq 1 ]]; then
            echo "::warning::Snapshot health WARNING: $OUTPUT"
            # Do not block for warning — proceed but leave audit trail
          else
            echo "Snapshot health OK: $OUTPUT"
          fi

      - name: Annotate summary
        if: always()
        run: |
          echo "## Elasticsearch Snapshot Gate" >> "$GITHUB_STEP_SUMMARY"
          echo "```json" >> "$GITHUB_STEP_SUMMARY"
          echo "${{ steps.snapshot_check.outputs.result }}" >> "$GITHUB_STEP_SUMMARY"
          echo "```" >> "$GITHUB_STEP_SUMMARY"


# Call this gate from your deployment workflow:
#
# jobs:
#   snapshot-gate:
#     uses: ./.github/workflows/es-snapshot-gate.yml
#     with:
#       es_host: https://es.staging.internal:9200
#       es_repo: staging-snapshots
#     secrets: inherit
#
#   deploy:
#     needs: snapshot-gate
#     ...

Snapshot Hygiene Checklist

Snapshot validation is one part of a complete backup strategy. The checklist below covers the full lifecycle: from repository configuration to actual restore testing, which is the only true validation that your backups are usable.

Configure a separate repository per retention policy

Do not mix short-retention snapshots (hourly, kept 48h) and long-retention snapshots (daily, kept 30d) in the same repository. SLM makes this easy with multiple policies pointing to different repositories. Mixing retention classes in one repository makes the SLM retention logic fragile and complicates restore operations.

Verify after every credential rotation

S3 IAM key rotation, GCS service account updates, and Azure storage access key changes are the most common causes of silent snapshot repository failures. Run es-snapshot-check immediately after any credential rotation to confirm the repository is still accessible before the next scheduled snapshot window.

Test restores to a staging cluster monthly

A snapshot that cannot be restored is not a backup. Schedule a monthly restore test that mounts the most recent snapshot to a separate staging cluster (read-only using searchable snapshots) and runs a minimal query suite to confirm index integrity. Automate this and fail the test if any shard fails to recover.

Monitor snapshot size trends

A snapshot that is abnormally small (compared to the 30-day rolling average) may indicate that a large index was accidentally deleted before the snapshot ran, or that the SLM policy is excluding indices it should capture. Track _snapshot/{repo}/_current and _snapshot/{repo}/_all?size=30size fields and alert on >30% deviation from the mean.

Use index patterns in SLM, not wildcards

Avoid indices: ["*"] in SLM policies. Explicitly list index patterns (e.g., ["logs-*", "metrics-*", "app-data-*"]) to prevent system indices, internal Kibana indices, and temporary reindex targets from inflating snapshot size and duration. Use include_global_state: false unless you specifically need to snapshot cluster settings and templates.

Document the restore procedure in a runbook

A valid, fresh snapshot is useless if no one knows how to restore from it. Write a runbook that covers: listing available snapshots, restoring a single index vs. full cluster, handling rename patterns, verifying shard recovery, and re-enabling SLM after a full cluster restore (it is often accidentally left disabled). Keep this runbook in the repository next to the snapshot configuration and review it annually.

Running Elasticsearch in production and need confidence in your backup strategy?

We audit and harden Elasticsearch backup pipelines — from snapshot repository configuration and SLM policy design to freshness monitoring, restore testing, and disaster recovery runbooks. Let’s talk.

Get in Touch