The Elastic Stack — formerly the ELK Stack — is the most widely deployed open-source observability and search platform in the world. It powers log analytics at Netflix, real-time search at GitHub, security monitoring at government agencies, and APM pipelines at thousands of engineering teams. This guide covers every layer of the stack from first principles to production hardening, with practical examples you can run today.
Note
This is a pillar guide. It covers the full Elastic Stack breadth. For deep dives into specific areas, see our companion articles: Elasticsearch Read Optimization for query-level tuning, and OpenTelemetry in Practice for integrating OTel data into Elasticsearch.
What Is the Elastic Stack?
The Elastic Stack is a collection of four open-source projects maintained by Elastic, designed to work together as a unified pipeline for ingesting, storing, searching, and visualizing data at any scale:
- Elasticsearch — the distributed search and analytics engine at the core. Stores data as JSON documents, indexes them with an inverted index, and exposes a REST API for full-text search, structured queries, and aggregations.
- Logstash — a server-side data processing pipeline. Ingests data from multiple sources, transforms it through filter plugins (grok, mutate, date, GeoIP), and outputs it to Elasticsearch or other destinations.
- Kibana — the visualization and exploration UI. Provides Discover for ad-hoc log exploration, Dashboards for operational monitoring, Lens for drag-and-drop charts, and Alerting for threshold-based notifications.
- Beats — lightweight data shippers written in Go. Filebeat tails log files, Metricbeat collects system and service metrics, Packetbeat captures network traffic, and Heartbeat monitors uptime. All ship directly to Elasticsearch or Logstash.
Together they form a complete observability data platform — but each component can also be used independently or replaced by alternatives like Fluentd (instead of Logstash) or Grafana (instead of Kibana).
Elasticsearch Architecture Deep Dive
Nodes and Roles
An Elasticsearch cluster is a group of nodes that collectively hold your data and provide indexing and search capabilities. Each node has one or more roles that determine what work it performs:
master— manages cluster state (index creation, node membership, shard allocation). Deploy 3 dedicated master-eligible nodes for production HA.data— holds shard data and handles CRUD, search, and aggregations. Subdivide intodata_hot,data_warm,data_coldfor ILM tiering.ingest— runs ingest pipelines (pre-indexing transformations). Use dedicated ingest nodes if you have heavy enrichment workloads.coordinating— routes requests and merges results. Every node can coordinate, but dedicated coordinating nodes reduce load on data nodes in large clusters.ml— runs machine learning jobs for anomaly detection and data frame analytics.
# elasticsearch.yml — dedicated hot data node
node.roles: [ data_hot ]
node.name: hot-01
cluster.name: prod-cluster
# JVM heap: half of available RAM, max 31GB
# -Xms16g -Xmx16g in jvm.optionsIndices, Shards, and Replicas
An index is a logical namespace for a collection of documents. Elasticsearch divides each index into shards — independent Lucene instances that can be distributed across nodes. Shards are the unit of parallelism: queries fan out to all shards in parallel, and results are merged.
Each shard has a configurable number of replicas— exact copies that serve read requests and provide fault tolerance. If a primary shard's node fails, a replica is promoted to primary automatically.
Warning
The number of primary shards is fixed at index creation time. Over-sharding is the most common Elasticsearch anti-pattern — it degrades performance and inflates cluster state. Target 20–50 GB per shard for search workloads, up to 50–200 GB for logging where search is less intensive. Use force-merge and ILM rollover to control shard sizes over time.
# Create an index with explicit shard settings
PUT /my-index
{
"settings": {
"number_of_shards": 3,
"number_of_replicas": 1,
"index.routing.allocation.require._tier_preference": "data_hot"
},
"mappings": {
"properties": {
"timestamp": { "type": "date" },
"service": { "type": "keyword" },
"message": { "type": "text" },
"level": { "type": "keyword" },
"duration_ms": { "type": "integer" }
}
}
}How Indexing Works
When a document is indexed, Elasticsearch routes it to a primary shard using a deterministic hash of the document ID (or a custom routing value). The primary shard writes the document to its in-memory buffer and the transaction log (translog). A refresh (every 1 second by default) flushes the buffer to a new Lucene segment, making the document visible to searches. A flush writes the segment to disk and clears the translog. Background merges compact multiple small segments into fewer large ones, improving query performance.
Tip
For bulk indexing, increase refresh_interval to 30s or -1 (disable) and restore after loading. This eliminates refresh overhead and can increase throughput by 10×. Also set number_of_replicas: 0 during bulk load, then re-enable replicas after.
Query DSL — Searching and Filtering
Elasticsearch's Query DSL is a JSON-based language for expressing searches. The fundamental distinction is between query context (affects relevance scoring) and filter context (binary yes/no, cached). Filters are faster and should be used for exact-match conditions.
Bool Query — The Foundation
The bool query is the primary building block. It combines clauses with four operators:
GET /logs-*/_search
{
"query": {
"bool": {
"must": [
{ "match": { "message": "payment failed" } }
],
"filter": [
{ "term": { "level": "ERROR" } },
{ "range": { "timestamp": { "gte": "now-1h" } } },
{ "terms": { "service": ["checkout", "billing"] } }
],
"must_not": [
{ "term": { "env": "staging" } }
]
}
},
"sort": [{ "timestamp": "desc" }],
"size": 100
}must clauses score and are required. filter clauses don't score and are cached — use them for all structured conditions. should boosts relevance but doesn't require a match (unless it's the only clause). must_not excludes matching documents.
Aggregations
Aggregations provide analytics over your data — equivalent to SQL GROUP BY, COUNT, and AVG, but executing in parallel across shards. Bucket aggregations group documents; metric aggregations compute values over a group.
GET /logs-*/_search
{
"size": 0,
"query": {
"range": { "timestamp": { "gte": "now-24h" } }
},
"aggs": {
"errors_per_service": {
"terms": {
"field": "service",
"size": 20,
"order": { "_count": "desc" }
},
"aggs": {
"error_rate_over_time": {
"date_histogram": {
"field": "timestamp",
"fixed_interval": "1h"
}
},
"avg_duration": {
"avg": { "field": "duration_ms" }
},
"p99_duration": {
"percentiles": {
"field": "duration_ms",
"percents": [50, 90, 95, 99]
}
}
}
}
}
}This single query returns, in one round trip: the top 20 services by error count, plus per-service hourly error trends, average response times, and p50/p90/p95/p99 latency percentiles — all filtered to the last 24 hours.
Index Lifecycle Management (ILM)
ILM automates the movement of indices through a lifecycle — hot → warm → cold → frozen → delete — based on age, size, or document count thresholds. This is the cornerstone of cost-efficient log storage: you keep recent data on fast SSD nodes, and older data on cheaper warm/cold nodes or object storage.
Defining an ILM Policy
PUT _ilm/policy/logs-policy
{
"policy": {
"phases": {
"hot": {
"min_age": "0ms",
"actions": {
"rollover": {
"max_primary_shard_size": "50gb",
"max_age": "1d"
},
"set_priority": { "priority": 100 }
}
},
"warm": {
"min_age": "3d",
"actions": {
"shrink": { "number_of_shards": 1 },
"forcemerge": { "max_num_segments": 1 },
"set_priority": { "priority": 50 },
"allocate": {
"require": { "_tier_preference": "data_warm" }
}
}
},
"cold": {
"min_age": "30d",
"actions": {
"searchable_snapshot": {
"snapshot_repository": "s3-repo"
}
}
},
"delete": {
"min_age": "90d",
"actions": { "delete": {} }
}
}
}
}Index Templates and Data Streams
Rather than managing individual indices, use data streams — an abstraction over a sequence of time-series indices (called backing indices) with automatic rollover. An index template applies settings, mappings, and the ILM policy to all new backing indices automatically.
# 1. Create an index template
PUT _index_template/logs-template
{
"index_patterns": ["logs-*"],
"data_stream": {},
"template": {
"settings": {
"index.lifecycle.name": "logs-policy",
"number_of_shards": 2,
"number_of_replicas": 1
},
"mappings": {
"properties": {
"@timestamp": { "type": "date" },
"service": { "type": "keyword" },
"level": { "type": "keyword" },
"message": { "type": "text" }
}
}
}
}
# 2. Create the data stream (backing index auto-created)
PUT _data_stream/logs-appTip
Data streams enforce that every document has a @timestamp field and that all writes go to the current write index. Rollover happens automatically when the hot policy condition is met. This eliminates the need to manage index aliases manually.
Logstash — Data Pipeline Engine
Logstash is a stateful ETL pipeline with a plugin-based architecture. A pipeline has three stages: input → filter → output. Multiple pipelines can run concurrently, each isolated in their own thread pool, enabling you to separate high-volume log ingestion from lower-volume metric pipelines.
A Production Pipeline
# /etc/logstash/conf.d/app-logs.conf
input {
beats {
port => 5044
ssl => true
ssl_certificate_authorities => ["/etc/logstash/certs/ca.crt"]
}
kafka {
bootstrap_servers => "kafka-01:9092,kafka-02:9092"
topics => ["app.logs.production"]
codec => "json"
consumer_threads => 4
}
}
filter {
# Parse structured fields from message
if [message] =~ /^{/ {
json { source => "message" }
} else {
grok {
match => {
"message" => "%{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:level} \[%{DATA:service}\] %{GREEDYDATA:body}"
}
}
}
# Normalize timestamp
date {
match => ["timestamp", "ISO8601"]
target => "@timestamp"
timezone => "UTC"
}
# Enrich with GeoIP if IP present
if [client_ip] {
geoip { source => "client_ip" target => "geoip" }
}
# Drop health-check noise
if [path] =~ "/health" or [level] == "DEBUG" {
drop {}
}
mutate {
remove_field => ["timestamp", "host", "agent"]
}
}
output {
elasticsearch {
hosts => ["https://es-01:9200", "https://es-02:9200"]
data_stream => true
data_stream_type => "logs"
data_stream_dataset => "app"
data_stream_namespace => "production"
ssl_certificate_authorities => ["/etc/logstash/certs/ca.crt"]
api_key => "${ES_API_KEY}"
ilm_rollover_alias => "logs-app"
}
}Persistent Queues and Dead Letter Queues
By default, Logstash uses in-memory queues — events are lost if Logstash crashes between input and output. Enable Persistent Queues (PQ) to buffer events to disk, surviving restarts without data loss:
# logstash.yml
queue.type: persisted
queue.max_bytes: 4gb
queue.checkpoint.writes: 1024
# Dead Letter Queue — capture failed events for inspection
dead_letter_queue.enable: true
dead_letter_queue.max_bytes: 1gbEvents that fail processing (e.g., invalid JSON, mapping conflicts) land in the Dead Letter Queue. You can replay them with the dead_letter_queue input plugin after fixing the pipeline.
Beats — Lightweight Data Shippers
Beats are single-purpose agents written in Go, designed for minimal footprint (typically under 50 MB RAM). They ship data directly to Elasticsearch or via Logstash for additional processing. The most commonly deployed:
Filebeat — Log Collection
Filebeat reads log files, tracks position (so it survives restarts), and ships events. It includes hundreds of pre-built modules for common log formats — NGINX, MySQL, Kubernetes, AWS, etc. — that auto-configure parsing and Kibana dashboards.
# filebeat.yml
filebeat.inputs:
- type: filestream
id: app-logs
paths:
- /var/log/app/*.log
- /var/log/app/**/*.log
parsers:
- multiline:
type: pattern
pattern: '^[0-9]{4}-[0-9]{2}-[0-9]{2}'
negate: true
match: after
fields:
service: my-app
env: production
fields_under_root: true
# Use modules for common services
filebeat.modules:
- module: nginx
access:
enabled: true
var.paths: ["/var/log/nginx/access.log"]
error:
enabled: true
output.logstash:
hosts: ["logstash-01:5044"]
ssl.certificate_authorities: ["/etc/filebeat/ca.crt"]
processors:
- add_host_metadata: ~
- add_kubernetes_metadata: ~Metricbeat — System and Service Metrics
Metricbeat collects system metrics (CPU, memory, disk, network) and service-specific metrics from MySQL, Redis, Elasticsearch itself, Kubernetes, Docker, and many others. It's the standard way to feed the Elasticsearch monitoring cluster.
# metricbeat.yml — monitor Elasticsearch itself
metricbeat.modules:
- module: elasticsearch
xpack.enabled: true
period: 10s
hosts: ["https://es-01:9200", "https://es-02:9200"]
ssl.certificate_authorities: ["/etc/metricbeat/ca.crt"]
api_key: "${ES_MONITORING_API_KEY}"
- module: system
period: 30s
metricsets:
- cpu
- memory
- filesystem
- network
- process
processes: ['.*']
output.elasticsearch:
hosts: ["https://monitoring-cluster:9200"]Kibana — Visualization and Operations
Kibana is much more than a dashboard tool. For day-to-day operations, the most important features are Discover, Dashboards, Alerting, and the Dev Tools Console.
Discover — Ad-Hoc Exploration
Discover is your primary interface for log investigation. It provides a time-series histogram, a document table, and a KQL (Kibana Query Language) search bar. KQL is a simplified syntax on top of the Query DSL:
# KQL examples in Kibana Discover
# Find all errors from the checkout service in the last hour
level: ERROR and service: checkout
# Find 5xx responses with slow response times
http.response.status_code >= 500 and duration_ms > 5000
# Wildcard search in message
message: *connection refused*
# Phrase match
message: "null pointer exception"
# Range on numeric field
duration_ms >= 1000 and duration_ms <= 5000Alerting Rules
Kibana Alerting (formerly Watcher in the commercial tier, now available in Basic) lets you define threshold-based rules that run on a schedule and trigger actions (Slack, PagerDuty, email, webhook). An Elasticsearch Query rule is the most flexible type:
# Via Kibana UI → Stack Management → Rules
# Rule: alert when error rate spikes
Rule type: Elasticsearch query
Index: logs-app
Query:
{
"bool": {
"filter": [
{ "term": { "level": "ERROR" } },
{ "range": { "@timestamp": { "gte": "now-5m" } } }
]
}
}
Threshold: count > 50 in the last 5 minutes
Action: Slack webhook → #alerts-prodSecurity — TLS, Authentication, and RBAC
Since Elasticsearch 8.0, security is enabled by default. Attempting to run a cluster without TLS now results in a startup failure. The key security layers are:
Transport and HTTP TLS
# elasticsearch.yml — minimal TLS config
xpack.security.enabled: true
xpack.security.transport.ssl:
enabled: true
verification_mode: certificate
keystore.path: /etc/elasticsearch/certs/elastic-certificates.p12
xpack.security.http.ssl:
enabled: true
keystore.path: /etc/elasticsearch/certs/http.p12Generate certificates with the bundled elasticsearch-certutiltool. For production, use a proper CA (Let's Encrypt for HTTP, internal CA for transport).
RBAC — Roles and Users
Elasticsearch uses role-based access control. A role defines which indices a user can access and what operations are permitted. Define a read-only role for Kibana dashboard viewers:
PUT _security/role/logs-viewer
{
"cluster": ["monitor"],
"indices": [
{
"names": ["logs-*"],
"privileges": ["read", "view_index_metadata"],
"query": {
"term": { "env": "production" }
}
}
],
"applications": [
{
"application": "kibana-.kibana",
"privileges": ["feature_discover.read", "feature_dashboard.read"],
"resources": ["space:default"]
}
]
}
PUT _security/user/alice
{
"password": "...",
"roles": ["logs-viewer"],
"full_name": "Alice Smith"
}Tip
Use API keys (POST _security/api_key) for service accounts instead of username/password. API keys are scoped, revocable, and don't expose credentials in config files. Store them in your secrets manager and reference via environment variables in Logstash/Beats configs.
Ingest Pipelines — Pre-Indexing Transformation
Ingest pipelines run on Elasticsearch's ingest nodes before a document is written to an index. They are lighter-weight than Logstash but support most common transformations. Use them when you want to enrich data without a separate Logstash deployment.
PUT _ingest/pipeline/parse-app-logs
{
"description": "Parse and enrich application logs",
"processors": [
{
"grok": {
"field": "message",
"patterns": [
"%{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:log.level} \[%{DATA:service.name}\] %{GREEDYDATA:log.original}"
]
}
},
{
"date": {
"field": "timestamp",
"formats": ["ISO8601"],
"target_field": "@timestamp"
}
},
{
"geoip": {
"field": "client.ip",
"target_field": "client.geo",
"ignore_missing": true
}
},
{
"set": {
"field": "data_stream.dataset",
"value": "app"
}
},
{
"remove": {
"field": ["timestamp", "message"],
"ignore_missing": true
}
}
],
"on_failure": [
{
"set": {
"field": "_index",
"value": "failed-{{ _index }}"
}
}
]
}Production Best Practices
Cluster Sizing Rules of Thumb
- JVM heap: no more than 50% of RAM, capped at 31 GB (beyond this, compressed ordinary object pointers are disabled and performance degrades sharply).
- Data nodes: size for 1:30 heap-to-disk ratio for hot data (e.g., 16 GB heap → 480 GB usable storage per node).
- Master nodes: always 3 dedicated masters — never collocate master + data on large clusters to avoid GC pauses affecting cluster stability.
- Shard count: keep total shard count per node below 20 shards per GB of heap (e.g., 16 GB heap → max 320 shards).
Operating System Configuration
# /etc/sysctl.d/elasticsearch.conf
vm.max_map_count = 262144 # Required — Lucene memory-mapped files
vm.swappiness = 1 # Minimize swap; set to 0 or disable entirely
net.core.somaxconn = 65535
# /etc/security/limits.conf for elasticsearch user
elasticsearch soft nofile 65536
elasticsearch hard nofile 65536
elasticsearch soft nproc 4096
elasticsearch hard nproc 4096
elasticsearch soft memlock unlimited
elasticsearch hard memlock unlimitedSnapshot and Restore
Register an S3 (or GCS/Azure) snapshot repository and automate daily snapshots with Snapshot Lifecycle Management (SLM). This is your last line of defense against data loss and the cheapest way to implement cold storage.
# Register S3 repository
PUT _snapshot/s3-repo
{
"type": "s3",
"settings": {
"bucket": "my-es-snapshots",
"region": "eu-central-1",
"server_side_encryption": true
}
}
# SLM policy — daily snapshot, retain 30 days
PUT _slm/policy/daily-snapshots
{
"schedule": "0 30 1 * * ?",
"name": "<daily-snap-{now/d}>",
"repository": "s3-repo",
"config": {
"include_global_state": false,
"indices": ["logs-*", "metrics-*"]
},
"retention": {
"expire_after": "30d",
"min_count": 7,
"max_count": 30
}
}Monitoring the Cluster
Run a dedicated monitoring cluster — never ship metrics to the same cluster you're monitoring. If that cluster goes down, you lose visibility exactly when you need it most. Use Metricbeat with the elasticsearch module pointing to your production cluster, outputting to the monitoring cluster.
Key metrics to alert on:
- Cluster status: alert on anything other than
green. Yellow = unassigned replicas. Red = unassigned primaries (data unavailable). - JVM heap used: alert above 75% sustained. GC pressure above 85% will cause performance cliffs.
- Search latency p99: track via
GET _nodes/stats/indices/search. Alert ifquery_time_in_millis / query_totaltrends upward. - Indexing throughput:
indexing.index_totaldelta. Alert on sudden drops (ingest pipeline down) or spikes (log storms). - Disk usage: alert at 75% — ILM needs headroom to complete rollovers and force-merges. Elasticsearch will refuse writes at 95% (
flood_stage).
Common Anti-Patterns to Avoid
Dynamic Mapping Gone Wrong
By default, Elasticsearch auto-detects field types from the first document. This works for prototyping but causes problems in production: a field mapped as long can't store a string value from a later deployment. The mapping explosion problem occurs when dynamic mapping creates thousands of fields — often from JSON logs with arbitrary key names. Mitigate by:
# Disable dynamic mapping for unknown fields, but allow known ones
PUT /logs-app
{
"mappings": {
"dynamic": "strict",
"properties": {
"@timestamp": { "type": "date" },
"service": { "type": "keyword" },
"level": { "type": "keyword" },
"message": { "type": "text" },
"duration_ms": { "type": "integer" },
"labels": {
"type": "object",
"dynamic": true
}
}
}
}Deep Pagination
Using from + size for deep pagination is expensive — Elasticsearch must fetch and discard from + size documents from every shard. For paginating through large result sets, use search_after with a sort tiebreaker:
# First page
GET logs-app/_search
{
"size": 100,
"sort": [{ "@timestamp": "desc" }, { "_id": "asc" }]
}
# Subsequent pages — pass last sort values from previous response
GET logs-app/_search
{
"size": 100,
"sort": [{ "@timestamp": "desc" }, { "_id": "asc" }],
"search_after": ["2026-04-16T12:34:56.000Z", "doc-id-xyz"]
}Unbounded Wildcard Queries
Leading wildcards (*foo) require scanning every term in the index and are among the most expensive queries possible. Use edge N-gram tokenizers at index time to support prefix-style autocomplete without runtime wildcard cost. For full-text search, use match or match_phrase instead.
Running Elastic Stack in production or planning a migration?
We design, deploy, and optimize Elastic Stack environments at enterprise scale — from cluster architecture and ILM policies to ingest pipelines, security hardening, and cost-efficient tiered storage. Let’s talk.
Send a Message