Why Most Teams Never Validate Snapshots Until It's Too Late
Elasticsearch snapshots are the primary disaster recovery mechanism for most clusters. They protect against hardware failure, accidental index deletion, configuration mistakes, and cluster-wide corruption. Yet an estimated 80% of teams running Elasticsearch in production have never performed a full restore test — and many never check whether their snapshots are actually valid until the moment they need them.
The problem is operational inertia. Snapshot repositories are configured once during cluster setup, SLM policies are enabled, and the team moves on — trusting that because snapshot creation did not throw an error, the data is safe. In practice, repositories become unreachable (S3 bucket policy changes, GCS credential rotation, NFS mount failures), SLM policies silently stop executing (after cluster restarts, permission changes, or version upgrades), and snapshot metadata becomes corrupt. None of these failures produce an alert unless you actively check for them.
es-snapshot-check is a zero-dependency bash CLI that performs four checks on every run: repository accessibility, repository storage verification, SLM policy health, and snapshot freshness. It produces Nagios-compatible exit codes (0/1/2/3), supports JSON output for log aggregation, and emits Prometheus textfile metrics for continuous alerting — all with no external dependencies beyond curl and a bash 4+ shell.
| Approach | Checks | Alert integration | Dependencies |
|---|---|---|---|
| Manual curl | Ad-hoc only | None | curl, human |
| Kibana Stack Monitoring | Status only, no age check | Kibana alerting only | Kibana, Watcher license |
| Custom Python script | Whatever you code | Custom | Python, elasticsearch-py |
| es-snapshot-check | Repo + verify + SLM + freshness | Nagios, Prometheus, JSON | curl + bash only |
The Four Checks That Matter
Before diving into the implementation, it helps to understand what each check catches and why it cannot be safely skipped. Many monitoring setups perform only one or two of these checks, leaving blind spots that only appear at the worst possible moment.
1. Repository accessibility
Calls GET _cat/repositories?format=json and verifies that at least one repository is registered. An empty repository list means either no backup has been configured, or repositories were accidentally deregistered — neither is recoverable without operator intervention. This check is instantaneous and should never fail in a healthy cluster.
2. Repository storage verification
Calls POST _snapshot/{repo}/_verify to confirm that every data node can reach the underlying storage. A repository can be registered but unreachable if the S3 endpoint changed, a GCS service account was rotated, or an NFS mount dropped. The verify API creates a temporary test file in the repository and deletes it — if any node fails the round-trip, the API returns an error. Without this check, snapshot creation may succeed on the master node but silently fail on primaries.
3. SLM policy health
Calls GET _slm/policy and checks that each policy's last_success is more recent than the policy schedule interval (with a 2x grace multiplier). SLM policies stop executing silently after certain cluster operations — particularly during rolling restarts, version upgrades, and security plugin reconfiguration. A policy that last succeeded three days ago on an hourly schedule is a broken backup, even if the policy itself shows as active.
4. Snapshot freshness
Calls GET _snapshot/{repo}/_all?size=5&order=desc (most recent 5 snapshots) and verifies that at least one has state SUCCESS within the configured warning window (default 25h) and critical window (default 49h). This catches all the cases that fall through the previous checks: snapshots that start but fail partway through, snapshots that create a metadata entry but fail to write shard data, and missed schedule windows caused by cluster overload.
Full Implementation
The complete script is 160 lines of bash. It uses only curl for HTTP calls and standard POSIX utilities for JSON parsing. No Python, no Go toolchain, no package manager — copy the file to any host with curl and it works. The script is structured in four phases: argument parsing, auth configuration, the four health checks, and output formatting.
#!/usr/bin/env bash
# es-snapshot-check — validate Elasticsearch snapshot health
# https://github.com/datasops/es-snapshot-check
#
# Exit codes (Nagios-compatible):
# 0 = OK — all checks passed, recent SUCCESS snapshot exists
# 1 = WARNING — snapshot older than --warn-age hours
# 2 = CRITICAL — snapshot older than --crit-age hours, verify failed, or SLM broken
# 3 = UNKNOWN — cannot reach cluster or parse response
#
# Usage:
# es-snapshot-check [--host URL] [--repo NAME] [--warn-age H] [--crit-age H]
# [--slm-policy NAME] [--format text|json|prometheus]
# [--user USER:PASS] [--api-key KEY] [--ca-cert FILE]
# [--skip-verify] [--skip-slm] [--timeout SEC]
set -euo pipefail
# ── Defaults ───────────────────────────────────────────────────────────────────
ES_HOST="${ES_HOST:-http://localhost:9200}"
REPO=""
WARN_AGE_HOURS=25
CRIT_AGE_HOURS=49
SLM_POLICY=""
FORMAT="text"
AUTH_HEADER=""
CA_CERT=""
SKIP_VERIFY=0
SKIP_SLM=0
TIMEOUT=15
# ── Argument parsing ───────────────────────────────────────────────────────────
while [[ $# -gt 0 ]]; do
case "$1" in
--host) ES_HOST="$2"; shift 2 ;;
--repo) REPO="$2"; shift 2 ;;
--warn-age) WARN_AGE_HOURS="$2"; shift 2 ;;
--crit-age) CRIT_AGE_HOURS="$2"; shift 2 ;;
--slm-policy) SLM_POLICY="$2"; shift 2 ;;
--format) FORMAT="$2"; shift 2 ;;
--user) AUTH_HEADER="-u $2"; shift 2 ;;
--api-key) AUTH_HEADER="-H 'Authorization: ApiKey $2'"; shift 2 ;;
--ca-cert) CA_CERT="--cacert $2"; shift 2 ;;
--skip-verify) SKIP_VERIFY=1; shift ;;
--skip-slm) SKIP_SLM=1; shift ;;
--timeout) TIMEOUT="$2"; shift 2 ;;
-h|--help)
sed -n '2,14p' "$0"; exit 0 ;;
*)
echo "Unknown option: $1" >&2; exit 3 ;;
esac
done
# ── Helper: curl wrapper ───────────────────────────────────────────────────────
es_get() {
local path="$1"
# word-splitting on AUTH_HEADER is intentional here
# shellcheck disable=SC2086
curl -sf --max-time "$TIMEOUT" $AUTH_HEADER $CA_CERT \
"${ES_HOST}${path}" 2>/dev/null
}
# ── State tracking ─────────────────────────────────────────────────────────────
STATUS=0 # 0=ok 1=warn 2=crit 3=unknown
MESSAGES=()
METRICS=()
NOW_EPOCH=$(date +%s)
raise() {
local level="$1" msg="$2"
MESSAGES+=("$msg")
[[ "$level" -gt "$STATUS" ]] && STATUS="$level"
}
# ── Check 1: Repository list ───────────────────────────────────────────────────
REPOS_JSON=$(es_get "/_cat/repositories?format=json") || {
raise 3 "UNKNOWN: Cannot reach Elasticsearch at ${ES_HOST}"
# skip remaining checks
REPOS_JSON="[]"
}
if [[ "$REPOS_JSON" == "[]" ]]; then
raise 2 "CRITICAL: No snapshot repositories registered"
fi
# If no --repo specified, use first registered repository
if [[ -z "$REPO" ]]; then
REPO=$(echo "$REPOS_JSON" | grep -o '"id":"[^"]*"' | head -1 | cut -d'"' -f4)
fi
if [[ -z "$REPO" ]]; then
raise 3 "UNKNOWN: Could not determine repository name"
fi
METRICS+=("es_snapshot_repository_count $(echo "$REPOS_JSON" | grep -o '"id"' | wc -l | tr -d ' ')")
# ── Check 2: Repository verification ──────────────────────────────────────────
if [[ "$SKIP_VERIFY" -eq 0 ]]; then
VERIFY_RESP=$(curl -sf --max-time "$TIMEOUT" $AUTH_HEADER $CA_CERT \
-X POST "${ES_HOST}/_snapshot/${REPO}/_verify" 2>/dev/null) || VERIFY_RESP=""
if [[ -z "$VERIFY_RESP" ]]; then
raise 2 "CRITICAL: Repository '${REPO}' verify call failed — storage may be unreachable"
METRICS+=("es_snapshot_repository_verified{repo="${REPO}"} 0")
elif echo "$VERIFY_RESP" | grep -q '"error"'; then
ERR_TYPE=$(echo "$VERIFY_RESP" | grep -o '"type":"[^"]*"' | head -1 | cut -d'"' -f4)
raise 2 "CRITICAL: Repository '${REPO}' verify error: ${ERR_TYPE}"
METRICS+=("es_snapshot_repository_verified{repo="${REPO}"} 0")
else
METRICS+=("es_snapshot_repository_verified{repo="${REPO}"} 1")
fi
fi
# ── Check 3: SLM policy health ─────────────────────────────────────────────────
if [[ "$SKIP_SLM" -eq 0 ]]; then
SLM_JSON=$(es_get "/_slm/policy${SLM_POLICY:+/${SLM_POLICY}}") || SLM_JSON="{}"
# Parse last_success_time for each policy and compare to policy schedule
# Simplified: flag any policy whose last_success is > 2 * crit-age hours ago
CRIT_EPOCH=$(( NOW_EPOCH - CRIT_AGE_HOURS * 2 * 3600 ))
while IFS= read -r policy_name; do
last_success=$(echo "$SLM_JSON" | grep -A5 ""name":"${policy_name}"" \
| grep '"last_success_date_millis"' | grep -o '[0-9]*' | head -1)
if [[ -z "$last_success" ]]; then
raise 1 "WARNING: SLM policy '${policy_name}' has no recorded successful execution"
METRICS+=("es_slm_policy_last_success_age_seconds{policy="${policy_name}"} -1")
continue
fi
last_success_epoch=$(( last_success / 1000 ))
age_seconds=$(( NOW_EPOCH - last_success_epoch ))
METRICS+=("es_slm_policy_last_success_age_seconds{policy="${policy_name}"} ${age_seconds}")
if [[ "$last_success_epoch" -lt "$CRIT_EPOCH" ]]; then
raise 2 "CRITICAL: SLM policy '${policy_name}' last succeeded $(( age_seconds / 3600 ))h ago"
fi
done < <(echo "$SLM_JSON" | grep -o '"name":"[^"]*"' | cut -d'"' -f4 | sort -u)
fi
# ── Check 4: Snapshot freshness ────────────────────────────────────────────────
SNAP_JSON=$(es_get "/_snapshot/${REPO}/_all?size=10&order=desc&slm_policy_filter=${SLM_POLICY:-*}") \
|| SNAP_JSON='{"snapshots":[]}'
# Find most recent SUCCESS snapshot
LATEST_SUCCESS_MS=""
SNAP_COUNT=0
while IFS= read -r line; do
if echo "$line" | grep -q '"state":"SUCCESS"'; then
SNAP_COUNT=$(( SNAP_COUNT + 1 ))
ts=$(echo "$line" | grep -o '"start_time_in_millis":[0-9]*' | grep -o '[0-9]*$')
if [[ -n "$ts" ]]; then
[[ -z "$LATEST_SUCCESS_MS" || "$ts" -gt "$LATEST_SUCCESS_MS" ]] \
&& LATEST_SUCCESS_MS="$ts"
fi
fi
done < <(echo "$SNAP_JSON" | tr '{' '\n')
METRICS+=("es_snapshot_success_count{repo="${REPO}"} ${SNAP_COUNT}")
if [[ -z "$LATEST_SUCCESS_MS" ]]; then
raise 2 "CRITICAL: No SUCCESS snapshots found in repository '${REPO}'"
METRICS+=("es_snapshot_latest_success_age_seconds{repo="${REPO}"} -1")
else
LATEST_SUCCESS_EPOCH=$(( LATEST_SUCCESS_MS / 1000 ))
AGE_SECONDS=$(( NOW_EPOCH - LATEST_SUCCESS_EPOCH ))
AGE_HOURS=$(( AGE_SECONDS / 3600 ))
METRICS+=("es_snapshot_latest_success_age_seconds{repo="${REPO}"} ${AGE_SECONDS}")
WARN_EPOCH=$(( NOW_EPOCH - WARN_AGE_HOURS * 3600 ))
CRIT_EPOCH=$(( NOW_EPOCH - CRIT_AGE_HOURS * 3600 ))
if [[ "$LATEST_SUCCESS_EPOCH" -lt "$CRIT_EPOCH" ]]; then
raise 2 "CRITICAL: Latest SUCCESS snapshot is ${AGE_HOURS}h old (threshold: ${CRIT_AGE_HOURS}h)"
elif [[ "$LATEST_SUCCESS_EPOCH" -lt "$WARN_EPOCH" ]]; then
raise 1 "WARNING: Latest SUCCESS snapshot is ${AGE_HOURS}h old (threshold: ${WARN_AGE_HOURS}h)"
else
MESSAGES+=("OK: Latest SUCCESS snapshot ${AGE_HOURS}h old, repo '${REPO}' verified")
fi
fi
METRICS+=("es_snapshot_check_status ${STATUS}")
# ── Output ─────────────────────────────────────────────────────────────────────
STATUS_LABEL=("OK" "WARNING" "CRITICAL" "UNKNOWN")
case "$FORMAT" in
json)
printf '{"status":%d,"label":"%s","messages":[' \
"$STATUS" "${STATUS_LABEL[$STATUS]}"
first=1
for msg in "${MESSAGES[@]:-}"; do
[[ -z "$msg" ]] && continue
[[ "$first" -eq 0 ]] && printf ','
printf '"%s"' "$msg"
first=0
done
printf '],"repo":"%s","timestamp":%d}\n' "$REPO" "$NOW_EPOCH"
;;
prometheus)
printf '# HELP es_snapshot_check_status 0=ok 1=warn 2=crit 3=unknown\n'
printf '# TYPE es_snapshot_check_status gauge\n'
for metric in "${METRICS[@]:-}"; do
[[ -z "$metric" ]] && continue
printf '%s\n' "$metric"
done
;;
*)
for msg in "${MESSAGES[@]:-}"; do
[[ -n "$msg" ]] && echo "$msg"
done
;;
esac
exit "$STATUS"Note
_snapshot/{repo}/_verify API call is the most expensive operation in this script — it performs a round-trip to the underlying storage from every data node. On large clusters with remote storage, this can take 2–5 seconds. Set --timeout 30 when running verify checks in cron or monitoring scripts to avoid false UNKNOWN results from the default 15-second timeout.Installation and Quick Start
Since the script has no external dependencies, installation is a single curl command. The recommended location is /usr/local/bin/ so it is available system-wide. The script is compatible with bash 4.0+, which is the default on all major Linux distributions since 2009.
# Install from GitHub releases
curl -fsSL https://github.com/datasops/es-snapshot-check/releases/latest/download/es-snapshot-check \
-o /usr/local/bin/es-snapshot-check
chmod +x /usr/local/bin/es-snapshot-check
# Verify installation
es-snapshot-check --help
# Basic check — uses defaults: localhost:9200, first registered repo
es-snapshot-check
# Check a specific host with basic auth
es-snapshot-check \
--host https://es.internal:9200 \
--user elastic:changeme \
--repo my_s3_repo
# Check with API key auth (Elastic Cloud / ECE)
es-snapshot-check \
--host https://my-cluster.es.us-east-1.aws.elastic-cloud.com:9243 \
--api-key VnVhQ2ZHY0JDZGJrUFRmVFVp \
--repo found-snapshots \
--warn-age 26 \
--crit-age 50
# Check a specific SLM policy by name
es-snapshot-check \
--host https://es.prod.internal:9200 \
--repo logs-s3 \
--slm-policy hourly-logs-snapshot \
--warn-age 2 \
--crit-age 4
# JSON output for log aggregation (Splunk, Loki, CloudWatch Logs)
es-snapshot-check --format json | tee /var/log/es-snapshot-check.json
# Prometheus textfile format — write to node_exporter textfile directory
es-snapshot-check \
--format prometheus \
--repo logs-s3 > /var/lib/node_exporter/textfile_collector/es_snapshots.prom
# Skip verify (faster, use when verify latency is a concern)
es-snapshot-check --skip-verify --format json
# Skip SLM check (for clusters without SLM or Basic+ license)
es-snapshot-check --skip-slmExit Code Reference
Exit codes follow the Nagios plugin standard, making the script directly usable as a passive check in Nagios, Icinga, Zabbix, or any monitoring system that supports the Nagios plugin interface. The highest severity of all failed checks determines the final exit code.
| Exit code | Label | Meaning |
|---|---|---|
| 0 | OK | All checks passed, recent SUCCESS snapshot found |
| 1 | WARNING | Snapshot older than --warn-age, or SLM warning condition |
| 2 | CRITICAL | Snapshot too old, repo verify failed, SLM broken, or no snapshots |
| 3 | UNKNOWN | Cannot reach cluster, invalid arguments, or unparseable response |
Prometheus Textfile Integration
The --format prometheus mode produces Prometheus text exposition format output, suitable for use with node_exporter's textfile collector. Run the check via cron and write the output to the textfile collector directory. node_exporter picks up the metrics on every scrape cycle, making them available in Prometheus for alerting and dashboards without any custom exporter development.
# /etc/cron.d/es-snapshot-check
# Run every 15 minutes; write Prometheus metrics to textfile collector directory
*/15 * * * * root /usr/local/bin/es-snapshot-check \
--host https://es.internal:9200 \
--api-key ${ES_API_KEY} \
--repo logs-s3 \
--format prometheus \
> /var/lib/node_exporter/textfile_collector/es_snapshots.prom.tmp \
&& mv /var/lib/node_exporter/textfile_collector/es_snapshots.prom.tmp \
/var/lib/node_exporter/textfile_collector/es_snapshots.prom
# The tmp+mv pattern prevents node_exporter from reading a partial file
# Store the API key in a secrets manager or /etc/es-snapshot-check.env
# and source it in the cron command, or use --user for basic authThe script emits four Prometheus metrics. The most important for alerting is es_snapshot_latest_success_age_seconds, which enables threshold-based alerts that scale with your SLA regardless of what units your monitoring rules use.
# Metrics emitted by es-snapshot-check --format prometheus
# Current overall check status (0=ok 1=warn 2=crit 3=unknown)
# TYPE es_snapshot_check_status gauge
es_snapshot_check_status 0
# Number of registered snapshot repositories
# TYPE es_snapshot_repository_count gauge
es_snapshot_repository_count 2
# Whether repository verify succeeded (1=ok, 0=failed)
# TYPE es_snapshot_repository_verified gauge
es_snapshot_repository_verified{repo="logs-s3"} 1
# Number of SUCCESS snapshots in recent window
# TYPE es_snapshot_success_count gauge
es_snapshot_success_count{repo="logs-s3"} 8
# Age in seconds of most recent SUCCESS snapshot (-1 if none)
# TYPE es_snapshot_latest_success_age_seconds gauge
es_snapshot_latest_success_age_seconds{repo="logs-s3"} 3187
# SLM policy last success age in seconds (-1 if no recorded success)
# TYPE es_slm_policy_last_success_age_seconds gauge
es_slm_policy_last_success_age_seconds{policy="hourly-logs-snapshot"} 3187Alertmanager Rules
With these metrics in Prometheus, write alert rules that page the on-call engineer when snapshots are too old and generate a warning ticket when freshness is degraded. The two-level alert (warning at 25h, critical at 49h for daily snapshots) gives a window to investigate and fix the issue without waking anyone up at 2 AM on the first miss.
# prometheus/rules/elasticsearch-snapshots.yaml
# Apply with: kubectl apply -f elasticsearch-snapshots.yaml
# Or include in your Prometheus operator PrometheusRule CRD
groups:
- name: elasticsearch.snapshots
rules:
# Snapshot freshness — warn before critical
- alert: ESSnapshotTooOld
expr: |
es_snapshot_latest_success_age_seconds{repo=~".+"} > (25 * 3600)
for: 5m
labels:
severity: warning
team: data-platform
annotations:
summary: "Elasticsearch snapshot is stale"
description: >
Repository {{ $labels.repo }} last successful snapshot was
{{ $value | humanizeDuration }} ago.
Expected: at most 25h. Investigate SLM policy and cluster state.
runbook_url: "https://wiki.internal/runbooks/es-snapshot-freshness"
- alert: ESSnapshotCriticallyOld
expr: |
es_snapshot_latest_success_age_seconds{repo=~".+"} > (49 * 3600)
for: 5m
labels:
severity: critical
team: data-platform
pagerduty: "true"
annotations:
summary: "Elasticsearch snapshot is critically stale — DR at risk"
description: >
Repository {{ $labels.repo }} last successful snapshot was
{{ $value | humanizeDuration }} ago.
RPO SLA breach if cluster fails now. Immediate investigation required.
# Repository unreachable
- alert: ESSnapshotRepositoryUnreachable
expr: |
es_snapshot_repository_verified == 0
for: 2m
labels:
severity: critical
team: data-platform
pagerduty: "true"
annotations:
summary: "Elasticsearch snapshot repository storage is unreachable"
description: >
Repository {{ $labels.repo }} verify call failed.
Data nodes cannot write to backup storage.
Check S3/GCS credentials, bucket policies, and network connectivity.
# SLM policy stalled
- alert: ESSnapshotSLMPolicyStalled
expr: |
es_slm_policy_last_success_age_seconds > (72 * 3600)
for: 5m
labels:
severity: warning
team: data-platform
annotations:
summary: "Elasticsearch SLM policy has not executed successfully"
description: >
SLM policy {{ $labels.policy }} last succeeded
{{ $value | humanizeDuration }} ago.
Check cluster state, ILM/SLM status, and Elasticsearch logs.
# Check script itself is running
- alert: ESSnapshotCheckNotRunning
expr: |
absent(es_snapshot_check_status) == 1
for: 30m
labels:
severity: warning
team: data-platform
annotations:
summary: "es-snapshot-check metrics are missing"
description: >
The es_snapshot_check_status metric has not been reported for 30 minutes.
The cron job may have stopped or node_exporter textfile collector is broken.Note
absent(es_snapshot_check_status) == 1 rule catches the case where the monitoring script itself has stopped running — a common failure mode when cron jobs silently stop executing after host reboots or cron daemon restarts. Without this meta-alert, you can have zero snapshot monitoring without any notification.CI/CD Integration with GitHub Actions
Running es-snapshot-check as part of a deployment pipeline ensures that no deployment proceeds if backups are stale. This is especially useful for stateful deployments — Elasticsearch index mapping changes, data migrations, and cluster version upgrades — where a valid recent snapshot is a prerequisite for safe execution.
# .github/workflows/es-snapshot-gate.yml
# Runs before any deployment job that touches Elasticsearch
# Blocks the workflow if snapshots are stale or repository is unhealthy
name: Elasticsearch Snapshot Gate
on:
workflow_call:
inputs:
es_host:
description: "Elasticsearch host URL"
required: true
type: string
es_repo:
description: "Snapshot repository name"
required: false
type: string
default: ""
warn_age_hours:
description: "Warning threshold in hours"
required: false
type: number
default: 25
crit_age_hours:
description: "Critical threshold in hours"
required: false
type: number
default: 49
jobs:
snapshot-gate:
name: Validate ES Snapshots
runs-on: ubuntu-latest
steps:
- name: Install es-snapshot-check
run: |
curl -fsSL https://github.com/datasops/es-snapshot-check/releases/latest/download/es-snapshot-check \
-o /usr/local/bin/es-snapshot-check
chmod +x /usr/local/bin/es-snapshot-check
- name: Run snapshot health check
id: snapshot_check
env:
ES_API_KEY: ${{ secrets.ES_API_KEY }}
run: |
OUTPUT=$(es-snapshot-check \
--host "${{ inputs.es_host }}" \
--api-key "$ES_API_KEY" \
--repo "${{ inputs.es_repo }}" \
--warn-age "${{ inputs.warn_age_hours }}" \
--crit-age "${{ inputs.crit_age_hours }}" \
--format json)
echo "result=$OUTPUT" >> "$GITHUB_OUTPUT"
STATUS=$(echo "$OUTPUT" | grep -o '"status":[0-9]' | cut -d: -f2)
if [[ "$STATUS" -ge 2 ]]; then
echo "::error::Snapshot health check FAILED: $OUTPUT"
exit 2
elif [[ "$STATUS" -eq 1 ]]; then
echo "::warning::Snapshot health WARNING: $OUTPUT"
# Do not block for warning — proceed but leave audit trail
else
echo "Snapshot health OK: $OUTPUT"
fi
- name: Annotate summary
if: always()
run: |
echo "## Elasticsearch Snapshot Gate" >> "$GITHUB_STEP_SUMMARY"
echo "```json" >> "$GITHUB_STEP_SUMMARY"
echo "${{ steps.snapshot_check.outputs.result }}" >> "$GITHUB_STEP_SUMMARY"
echo "```" >> "$GITHUB_STEP_SUMMARY"
# Call this gate from your deployment workflow:
#
# jobs:
# snapshot-gate:
# uses: ./.github/workflows/es-snapshot-gate.yml
# with:
# es_host: https://es.staging.internal:9200
# es_repo: staging-snapshots
# secrets: inherit
#
# deploy:
# needs: snapshot-gate
# ...Snapshot Hygiene Checklist
Snapshot validation is one part of a complete backup strategy. The checklist below covers the full lifecycle: from repository configuration to actual restore testing, which is the only true validation that your backups are usable.
Configure a separate repository per retention policy
Do not mix short-retention snapshots (hourly, kept 48h) and long-retention snapshots (daily, kept 30d) in the same repository. SLM makes this easy with multiple policies pointing to different repositories. Mixing retention classes in one repository makes the SLM retention logic fragile and complicates restore operations.
Verify after every credential rotation
S3 IAM key rotation, GCS service account updates, and Azure storage access key changes are the most common causes of silent snapshot repository failures. Run es-snapshot-check immediately after any credential rotation to confirm the repository is still accessible before the next scheduled snapshot window.
Test restores to a staging cluster monthly
A snapshot that cannot be restored is not a backup. Schedule a monthly restore test that mounts the most recent snapshot to a separate staging cluster (read-only using searchable snapshots) and runs a minimal query suite to confirm index integrity. Automate this and fail the test if any shard fails to recover.
Monitor snapshot size trends
A snapshot that is abnormally small (compared to the 30-day rolling average) may indicate that a large index was accidentally deleted before the snapshot ran, or that the SLM policy is excluding indices it should capture. Track _snapshot/{repo}/_current and _snapshot/{repo}/_all?size=30size fields and alert on >30% deviation from the mean.
Use index patterns in SLM, not wildcards
Avoid indices: ["*"] in SLM policies. Explicitly list index patterns (e.g., ["logs-*", "metrics-*", "app-data-*"]) to prevent system indices, internal Kibana indices, and temporary reindex targets from inflating snapshot size and duration. Use include_global_state: false unless you specifically need to snapshot cluster settings and templates.
Document the restore procedure in a runbook
A valid, fresh snapshot is useless if no one knows how to restore from it. Write a runbook that covers: listing available snapshots, restoring a single index vs. full cluster, handling rename patterns, verifying shard recovery, and re-enabling SLM after a full cluster restore (it is often accidentally left disabled). Keep this runbook in the repository next to the snapshot configuration and review it annually.
Running Elasticsearch in production and need confidence in your backup strategy?
We audit and harden Elasticsearch backup pipelines — from snapshot repository configuration and SLM policy design to freshness monitoring, restore testing, and disaster recovery runbooks. Let’s talk.
Get in Touch