What industries do you work with?

We work across a wide range of industries including finance, healthcare, e-commerce, logistics, and telecommunications. Our solutions are tailored to each client’s specific domain requirements and regulatory environment.

How long does a typical engagement take?

It depends on the scope. A focused observability deployment or automation workflow can be delivered in 4-6 weeks. Larger initiatives like full-scale LLM integration or platform builds typically run 2-4 months. We always start with a discovery phase to align on timelines.

Do you offer ongoing support after project delivery?

Yes. We offer flexible support and maintenance plans to ensure your systems stay healthy, updated, and optimized. We can also embed with your team on a part-time basis for continuous improvement.

Can you work with our existing tech stack?

Absolutely. We integrate with your current infrastructure and tools rather than forcing a rip-and-replace. Whether you’re on AWS, GCP, Azure, or on-prem, we adapt our approach to what works best for your environment.

What is your pricing model?

We offer both fixed-price project engagements and time-and-materials contracts depending on the nature of the work. Reach out through our contact form and we’ll provide a tailored estimate within 24 hours.

How do you handle data security and compliance?

Security is built into every engagement. We follow industry best practices for data handling, support GDPR and SOC 2 compliance requirements, and can work within your existing security policies and access controls.

Elasticsearch Index Lifecycle Management — Automate Hot-Warm-Cold Architectures

The Index Sprawl Problem

A production Elastic Stack cluster ingesting log or metric data will create hundreds of indices over its lifetime. Without automation, operations teams end up manually curating storage: deleting old indices when disks fill up, moving cold data to cheaper nodes by hand, force-merging forgotten segments that consume file handles. This is not a workflow — it is a liability.

Index Lifecycle Management (ILM) is Elasticsearch's built-in policy engine for automating exactly this. Define a policy once — rollover at 50 GB, warm after 7 days, cold after 30 days, delete after 90 — and every index governed by that policy moves through its lifecycle without human intervention. When paired with a tiered node architecture, ILM can cut Elasticsearch storage costs by 60–80% while maintaining sub-second query latency for recent data. Teams considering the open-source path may also evaluate migrating to OpenSearch, which offers Index State Management (ISM) as an equivalent to ILM.

The Five ILM Phases Explained

ILM organizes an index's life into up to five sequential phases. Not every policy uses all five — most production setups use hot, warm, and delete at minimum, adding cold or frozen for cost-sensitive long-term retention.

HOTActive Indexing Phase

The index is actively written to. Hot-phase actions include rollover (create a new write index when size, document count, or age thresholds are crossed) and read-only (make the index read-only before rollover). Hot nodes carry NVMe SSDs and all primary + replica shards.

WARMRecent Data, Read-Only

The index is no longer written to but is queried frequently. ILM can shrink the index to fewer shards, force-merge segments to 1 per shard (reduces heap pressure and speeds up queries), and migrate shards to warm-tier nodes with cheaper SATA SSDs.

COLDInfrequent Access

Data queried only occasionally. Cold-phase options include migrating to cold-tier nodes (spinning disks, or fewer replicas) and enabling searchable snapshots — storing data in S3/GCS and mounting it on-demand, eliminating replica overhead entirely. For governance and access control on the underlying data lake where snapshots land, see AWS Lake Formation data governance.

FROZENArchive Tier

Data rarely queried. Frozen-phase searchable snapshots store index data entirely in object storage; only a tiny metadata footprint lives on the node. Query performance is slower (on-demand loading), but storage cost approaches object-storage pricing — often 10–20x cheaper than hot-tier SSDs.

DELETEEnd of Life

The index (and any associated searchable snapshot) is permanently removed. ILM can optionally wait for a snapshot to complete before deleting, ensuring object-storage backups are intact before the cluster-side data disappears.

Configuring Node Tiers and Data Roles

ILM's tier migration actions work by routing shards to nodes with specific data tier roles. Each node in your cluster declares which tiers it belongs to in elasticsearch.yml. Nodes can hold multiple roles, but production clusters should keep hot and warm nodes separate for hardware and performance isolation.

# elasticsearch.yml — hot-tier node
# Dedicated hot nodes: NVMe SSDs, high IOPS, all memory for indexing
node.roles: [ data_hot, ingest ]
node.name: hot-01
path.data: /mnt/nvme/elasticsearch

---

# elasticsearch.yml — warm-tier node
# Dedicated warm nodes: SATA SSDs, more storage per dollar
node.roles: [ data_warm ]
node.name: warm-01
path.data: /mnt/sata/elasticsearch

---

# elasticsearch.yml — cold-tier node
# Dedicated cold nodes: spinning disks or high-density SSDs
node.roles: [ data_cold ]
node.name: cold-01
path.data: /mnt/hdd/elasticsearch

---

# elasticsearch.yml — frozen-tier node
# Minimal local cache — data lives in object storage (S3/GCS/Azure Blob)
node.roles: [ data_frozen ]
node.name: frozen-01
# Frozen nodes need a snapshot cache directory
path.data: /mnt/cache/elasticsearch

---

# elasticsearch.yml — content/master nodes (not part of data tiers)
node.roles: [ master, data_content ]
node.name: master-01

Note

A single-tier cluster (all nodes set to data role) still supports ILM, but tier migration actions will be no-ops — all nodes are considered hot. To get meaningful cost savings from ILM you need at least two distinct tier roles. If you cannot add hardware, consider using node attributes with custom allocation filters instead of named roles, though the tier-based approach is the modern recommended path.

Writing Your First ILM Policy

ILM policies are defined as JSON documents and stored in the Elasticsearch cluster state. They can be created via the REST API, Kibana's Stack Management > Index Lifecycle Policies UI, or infrastructure-as-code tools like the elasticstack Terraform provider. The REST API is the most explicit and reproducible approach.

# Create a production ILM policy for application logs
# 7 days hot → 30 days warm → 60 days cold → delete at 90 days

PUT _ilm/policy/logs-production
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_primary_shard_size": "50gb",
            "max_age": "1d",
            "max_docs": 100000000
          },
          "set_priority": {
            "priority": 100
          }
        }
      },
      "warm": {
        "min_age": "7d",
        "actions": {
          "migrate": {},
          "shrink": {
            "number_of_shards": 1
          },
          "forcemerge": {
            "max_num_segments": 1
          },
          "set_priority": {
            "priority": 50
          }
        }
      },
      "cold": {
        "min_age": "30d",
        "actions": {
          "migrate": {},
          "searchable_snapshot": {
            "snapshot_repository": "s3-logs-archive",
            "force_merge_index": true
          },
          "set_priority": {
            "priority": 0
          }
        }
      },
      "delete": {
        "min_age": "90d",
        "actions": {
          "delete": {
            "delete_searchable_snapshot": true
          }
        }
      }
    }
  }
}

Note

The min_age in warm, cold, and delete phases is measured from the index rollover time (not creation time) when the index was created by rollover. For indices not created by rollover, it is measured from the index creation date. This distinction matters when bootstrapping ILM on existing indices — their effective age starts from creation, so a 30-day-old index added to ILM may jump directly to the warm phase on the next ILM check cycle.

Rollover: The Core of Hot-Phase Management

Rollover is the action that creates a new write index and marks the old one as read-only. It is triggered when any one of the configured conditions is met — size, document count, or age. Setting multiple conditions is additive (OR logic), so the policy rolls over whichever threshold is crossed first.

ILM rollover requires an index alias or data stream with a write target. Without one, ILM has no way to know which index to write to after rollover. The conventional naming pattern is logs-app-000001, logs-app-000002, etc., with the alias logs-app pointing to the current write index.

# Bootstrap the first index with proper ILM settings and alias
# Run this once before your first document ingestion

PUT logs-app-000001
{
  "aliases": {
    "logs-app": {
      "is_write_index": true
    }
  },
  "settings": {
    "index": {
      "lifecycle": {
        "name": "logs-production",
        "rollover_alias": "logs-app"
      },
      "number_of_shards": 2,
      "number_of_replicas": 1,
      "routing.allocation.include._tier_preference": "data_hot"
    }
  }
}

# Verify ILM is tracking the index
GET logs-app-000001/_ilm/explain

# Example output from _ilm/explain — healthy ILM tracking
{
  "indices": {
    "logs-app-000001": {
      "index": "logs-app-000001",
      "managed": true,
      "policy": "logs-production",
      "lifecycle_date_millis": 1745000000000,
      "phase": "hot",
      "phase_time_millis": 1745000000000,
      "action": "rollover",
      "action_time_millis": 1745000100000,
      "step": "check-rollover-ready",
      "step_time_millis": 1745000200000,
      "phase_execution": {
        "policy": "logs-production",
        "phase_definition": {
          "min_age": "0ms",
          "actions": {
            "rollover": {
              "max_primary_shard_size": "50gb",
              "max_age": "1d"
            }
          }
        }
      }
    }
  }
}

Attaching ILM to Index Templates

In production, you should not configure ILM settings on individual indices by hand. Instead, use composable index templates to inject ILM policy, shard count, and tier routing automatically for any index matching a pattern. This ensures every new index — whether created by rollover, Beats, Logstash, or a data stream — inherits the correct lifecycle settings without manual intervention.

# Step 1: Create a component template for ILM settings
# Component templates are reusable building blocks

PUT _component_template/logs-ilm-settings
{
  "template": {
    "settings": {
      "index.lifecycle.name": "logs-production",
      "index.lifecycle.rollover_alias": "logs-app",
      "index.number_of_shards": 2,
      "index.number_of_replicas": 1,
      "index.routing.allocation.include._tier_preference": "data_hot"
    }
  }
}

# Step 2: Create a component template for mappings
PUT _component_template/logs-mappings
{
  "template": {
    "mappings": {
      "dynamic": "strict",
      "properties": {
        "@timestamp":    { "type": "date" },
        "log.level":     { "type": "keyword" },
        "service.name":  { "type": "keyword" },
        "trace.id":      { "type": "keyword" },
        "message":       { "type": "text", "norms": false },
        "http.status":   { "type": "short" },
        "http.method":   { "type": "keyword" },
        "url.path":      { "type": "keyword" },
        "duration_ms":   { "type": "float" }
      }
    }
  }
}

# Step 3: Create the composable index template combining both components
PUT _index_template/logs-app-template
{
  "index_patterns": ["logs-app-*"],
  "composed_of": ["logs-ilm-settings", "logs-mappings"],
  "priority": 200,
  "template": {
    "settings": {
      "index.codec": "best_compression"
    }
  },
  "_meta": {
    "description": "Production logs — ILM managed, hot-warm-cold",
    "team": "platform-engineering"
  }
}

Data Streams: The Modern Approach

For append-only time-series data, data streams are the recommended abstraction over manual alias management. A data stream is a named abstraction that routes writes to the current backing index and automatically manages rollover via ILM — without requiring you to create the first index, manage aliases, or track write targets manually.

Data streams also enforce the @timestamp field requirement and expose a clean API for querying across all backing indices. Kibana, Beats, Elastic Agent, and the OpenTelemetry Collector all write to data streams natively.

# Create a data stream index template (no alias bootstrapping required)
# ILM automatically manages backing indices

PUT _index_template/metrics-services-template
{
  "index_patterns": ["metrics-services-*"],
  "data_stream": {},
  "composed_of": ["metrics-ilm-settings", "metrics-mappings"],
  "priority": 200
}

# Create the data stream — ILM creates the first backing index automatically
PUT _data_stream/metrics-services-production

# Write to the data stream (always routed to the current write index)
POST metrics-services-production/_doc
{
  "@timestamp": "2026-04-19T12:00:00Z",
  "service.name": "payment-api",
  "http.status": 200,
  "duration_ms": 45.3
}

# Inspect backing indices and ILM phase for each
GET _data_stream/metrics-services-production

# Manually trigger rollover (useful for testing ILM transitions)
POST metrics-services-production/_rollover

Monitoring ILM Execution and Troubleshooting

ILM checks policies on a configurable interval (default: 10 minutes). Steps within a phase execute asynchronously — a force-merge on a large shard may take hours. Understanding how to inspect ILM state is critical for diagnosing stuck transitions, failed actions, and unexpected rollover behavior.

# Check ILM status for all managed indices
GET */_ilm/explain?human=true

# Check a specific index with human-readable timestamps
GET logs-app-000005/_ilm/explain?human=true

# ILM step failure — look for "failed_step" and "step_info"
# Example output when an action fails:
# {
#   "phase": "warm",
#   "action": "forcemerge",
#   "step": "forcemerge",
#   "failed_step": "forcemerge",
#   "step_info": {
#     "type": "exception",
#     "reason": "primary shard is not active"
#   }
# }

# Retry a failed step after fixing the underlying issue
POST logs-app-000005/_ilm/retry

# Change the ILM poll interval (default 10m — reduce for testing)
PUT _cluster/settings
{
  "persistent": {
    "indices.lifecycle.poll_interval": "1m"
  }
}

# Move an index to a specific phase/step manually (for testing or recovery)
POST _ilm/move/logs-app-000005
{
  "current_step": {
    "phase": "warm",
    "action": "complete",
    "name": "complete"
  },
  "next_step": {
    "phase": "cold",
    "action": "migrate",
    "name": "migrate"
  }
}

Note

The most common cause of stuck ILM transitions is a mismatch between the cluster's node roles and the policy's expected tier. If you define a cold phase migrate action but have no nodes with the data_cold role, the migration will wait indefinitely — the index stays on warm nodes, accruing age, while ILM reports it is in the cold phase waiting for allocation. Always verify tier role coverage before deploying a multi-tier ILM policy.

Managing ILM Policies Programmatically

For infrastructure-as-code workflows, the official Elasticsearch Python client exposes the full ILM API. The pattern below is suitable for CI/CD pipelines that apply ILM policies as part of cluster bootstrapping or upgrades.

from elasticsearch import Elasticsearch
from elasticsearch.exceptions import NotFoundError
import json

es = Elasticsearch(
    hosts=["https://elasticsearch:9200"],
    api_key=("key-id", "key-secret"),  # prefer API key auth in CI
)


def upsert_ilm_policy(name: str, policy: dict) -> None:
    """Create or update an ILM policy — idempotent."""
    try:
        existing = es.ilm.get_lifecycle(name=name)
        current_policy = existing[name]["policy"]
        if current_policy == policy:
            print(f"ILM policy '{name}' is up to date — no changes.")
            return
    except NotFoundError:
        pass

    es.ilm.put_lifecycle(name=name, policy=policy)
    print(f"ILM policy '{name}' applied.")


LOGS_POLICY = {
    "phases": {
        "hot": {
            "min_age": "0ms",
            "actions": {
                "rollover": {
                    "max_primary_shard_size": "50gb",
                    "max_age": "1d",
                },
                "set_priority": {"priority": 100},
            },
        },
        "warm": {
            "min_age": "7d",
            "actions": {
                "migrate": {},
                "shrink": {"number_of_shards": 1},
                "forcemerge": {"max_num_segments": 1},
                "set_priority": {"priority": 50},
            },
        },
        "cold": {
            "min_age": "30d",
            "actions": {
                "migrate": {},
                "searchable_snapshot": {
                    "snapshot_repository": "s3-logs-archive",
                    "force_merge_index": True,
                },
                "set_priority": {"priority": 0},
            },
        },
        "delete": {
            "min_age": "90d",
            "actions": {
                "delete": {"delete_searchable_snapshot": True},
            },
        },
    }
}

upsert_ilm_policy("logs-production", LOGS_POLICY)


def list_indices_by_phase(policy_name: str) -> dict[str, list[str]]:
    """Group managed indices by their current ILM phase."""
    response = es.ilm.explain_lifecycle(index="*")
    by_phase: dict[str, list[str]] = {}
    for index_name, info in response["indices"].items():
        if not info.get("managed"):
            continue
        if info.get("policy") != policy_name:
            continue
        phase = info.get("phase", "unknown")
        by_phase.setdefault(phase, []).append(index_name)
    return by_phase


phases = list_indices_by_phase("logs-production")
for phase, indices in sorted(phases.items()):
    print(f"{phase:10s}: {len(indices)} indices")
    for idx in sorted(indices)[:5]:  # show first 5 per phase
        print(f"           {idx}")

Searchable Snapshots: Cost-Efficient Cold and Frozen Storage

Searchable snapshots store index data in an object-storage repository (Amazon S3, Google Cloud Storage, Azure Blob Storage, or self-hosted S3-compatible stores like MinIO) while making it queryable without full restore. In the cold tier, Elasticsearch caches the most-accessed segment files locally. In the frozen tier, only a minimal metadata cache is maintained — each query fetches the required segments on demand.

Enabling searchable snapshots for ILM requires a registered snapshot repository. The repository is created once per cluster and referenced by name in the ILM policy.

# Register an S3 snapshot repository
# Requires the repository-s3 plugin and AWS credentials in the keystore

PUT _snapshot/s3-logs-archive
{
  "type": "s3",
  "settings": {
    "bucket": "my-elasticsearch-snapshots",
    "region": "eu-west-1",
    "base_path": "logs-archive",
    "compress": true,
    "server_side_encryption": true,
    "storage_class": "standard_ia"  // S3 Infrequent Access — cheaper for cold data
  }
}

# Verify the repository is reachable
POST _snapshot/s3-logs-archive/_verify

# ILM cold phase with searchable snapshots enabled
# (reference the same repository name in your policy)
# "cold": {
#   "min_age": "30d",
#   "actions": {
#     "searchable_snapshot": {
#       "snapshot_repository": "s3-logs-archive"
#     }
#   }
# }

# Check a mounted searchable snapshot index
GET .ds-logs-app-000003/_stats?human=true

# Searchable snapshot indices have a special _tier value
GET .ds-logs-app-000003/_settings?filter_path=**.routing.allocation

Note

When ILM transitions an index to cold with searchable snapshots, it creates a new index (prefixed with restored- or partial-) that points to the object-store snapshot. The original index is deleted. Queries against the data stream alias are automatically routed to the correct backing index — this transition is transparent to clients. Frozen-tier indices use partial- prefixed indices and fetch segment data per-query; expect 2–10× higher query latency compared to warm-tier indices.

Production Best Practices

Start with conservative rollover thresholds

50 GB primary shard size is a safe default. Shards larger than 50–100 GB slow down recovery, segment merging, and replication. If your ingest rate is very high (10+ GB/day), pair size limits with a 1-day max age to ensure daily rollover even on low-traffic days — consistent rollover timing simplifies retention calculations.

Always shrink before force-merging

Shrink reduces the number of shards (e.g., 5 hot shards → 1 warm shard) before force-merge runs on the single shard. Force-merging 5 shards in parallel on a warm node is significantly more expensive than merging 1. ILM runs these in the correct order when both are defined in the warm phase.

Store ILM policies in version control

Treat ILM policies like application configuration: store them in a Git repository, apply them via a CI/CD pipeline, and require code review for changes. A policy change affects every index using it — the blast radius is cluster-wide. The Python upsert_ilm_policy pattern above is designed to be idempotent and safe to run in CI.

Monitor ILM lag, not just phase

An index in the warm phase with an age of 45 days that should have entered cold at 30 days has an ILM lag of 15 days. This is usually caused by a long-running force-merge or a stuck allocation. Export _ilm/explain to your monitoring stack and alert when age significantly exceeds the expected phase transition time. For building a full alerting pipeline around Elasticsearch metrics, see Grafana, Loki, and Alertmanager in production.

Test phase transitions with a fast-forward policy

Create a test ILM policy with all min_age values set to seconds (e.g., "5s", "30s"). Apply it to a test index, reduce the poll interval to 1m, and watch the index progress through all phases within minutes. This validates your node tier configuration, snapshot repository, and allocation rules before deploying to production.

Once your lifecycle policies control storage costs, the next lever is query latency — see Elasticsearch Read Optimization for shard sizing, force-merge tuning, and Search Profiler workflows that reduce p99 latency on warm and cold data.

Building AI agents for your product or internal tooling?

We design and implement production-grade AI agent systems — from tool orchestration and memory architecture to error recovery, observability, and evaluation pipelines. Let’s talk.

Get in Touch

Building AI Agents That Actually Work — Tool Orchestration, Memory, and Error Recovery

The Index Sprawl Problem

The Five ILM Phases Explained

HOTActive Indexing Phase

WARMRecent Data, Read-Only

COLDInfrequent Access

FROZENArchive Tier

DELETEEnd of Life

Configuring Node Tiers and Data Roles

Writing Your First ILM Policy

Rollover: The Core of Hot-Phase Management

Attaching ILM to Index Templates

Data Streams: The Modern Approach

Monitoring ILM Execution and Troubleshooting

Managing ILM Policies Programmatically

Searchable Snapshots: Cost-Efficient Cold and Frozen Storage

Production Best Practices

Start with conservative rollover thresholds

Always shrink before force-merging

Store ILM policies in version control

Monitor ILM lag, not just phase

Test phase transitions with a fast-forward policy

Building AI agents for your product or internal tooling?

Need help implementing this in production?

Building AI Agents That Actually Work — Tool Orchestration, Memory, and Error Recovery

The Index Sprawl Problem

The Five ILM Phases Explained

HOTActive Indexing Phase

WARMRecent Data, Read-Only

COLDInfrequent Access

FROZENArchive Tier

DELETEEnd of Life

Configuring Node Tiers and Data Roles

Writing Your First ILM Policy

Rollover: The Core of Hot-Phase Management

Attaching ILM to Index Templates

Data Streams: The Modern Approach

Monitoring ILM Execution and Troubleshooting

Managing ILM Policies Programmatically

Searchable Snapshots: Cost-Efficient Cold and Frozen Storage

Production Best Practices

Start with conservative rollover thresholds

Always shrink before force-merging

Store ILM policies in version control

Monitor ILM lag, not just phase

Test phase transitions with a fast-forward policy

Building AI agents for your product or internal tooling?

Related Articles

Need help implementing this in production?