Why Redis is not just a cache
Redis started its life in 2009 as a simple in-memory key-value store. In 2026 it is the most widely deployed piece of infrastructure in the modern backend stack — not because it does one thing well, but because it provides a small set of primitives that compose into solutions for an enormous range of problems: session storage, rate limiting, distributed locks, job queues, leaderboards, real-time analytics, message fanout, and event streaming. The Redis data model is not a relational table or a document store — it is a collection of typed data structures, each with its own set of atomic operations.
Running Redis correctly in production requires understanding more than just GET and SET. The choice of eviction policy determines what happens when memory is exhausted. The choice between RDB snapshots and AOF logging determines your durability guarantees after a crash. The choice between Pub/Sub and Streams determines whether a slow consumer causes message loss. This article covers the patterns that separate a Redis deployment that mostly works from one that is genuinely production-ready.
The Redis documentation is exceptionally well-written and is the authoritative source for command semantics. The Redis in Action book (freely available online) remains the most thorough treatment of Redis patterns in the wild.
The Redis Data Model — Choosing the Right Structure
Every Redis key maps to a typed value. The type determines which commands are valid and what performance characteristics to expect. Using the wrong type for a problem — storing a JSON blob as a String when a Hash would give you field-level atomic updates — is the most common source of unnecessary complexity in Redis integrations.
String
The base type. Stores any binary-safe value up to 512 MB. Used for simple key-value caching, counters (INCR, INCRBY are atomic), feature flags, and distributed locks. The SETNX (set if not exists) command is the foundation of the Redlock distributed locking algorithm. NX + EX flags on SET replace SETNX + EXPIRE in a single atomic operation.
Hash
A map of field-value pairs stored under a single key. Ideal for representing objects (user sessions, product records) where individual fields need to be read or updated atomically without fetching the entire object. HGET, HSET, HINCRBY, and HGETALL give you fine-grained access. Hashes use ziplist encoding below a configurable threshold, making them significantly more memory-efficient than individual String keys for small objects.
List
A doubly-linked list supporting O(1) push/pop from both ends. LPUSH + BRPOP is the classic pattern for a simple blocking job queue. Lists are ordered by insertion time. LRANGE gives you range queries over the list. For bounded lists (recent activity feeds), LTRIM keeps the list at a fixed length. Not suitable for O(1) membership testing — use a Set for that.
Set
An unordered collection of unique strings. SADD, SREM, SISMEMBER, and SMEMBERS provide O(1) membership testing and set algebra (SUNION, SINTER, SDIFF). Used for tagging, unique visitor tracking, and maintaining whitelist/blacklist sets. SADD is idempotent — adding an existing member is a no-op. Random sampling with SRANDMEMBER avoids the need to load the full set.
Sorted Set (ZSet)
Members are unique strings, each associated with a float score. Members are ordered by score ascending. ZADD, ZRANK, ZRANGE, ZRANGEBYSCORE give you leaderboard queries, priority queue semantics, and time-window aggregations. The skip-list data structure gives O(log N) for add, remove, and range queries. Classic use cases: rate limiting windows (score = timestamp), leaderboards (score = points), and scheduling future tasks (score = scheduled epoch).
Stream
An append-only log of entries, each with an auto-generated ID (millisecond timestamp + sequence) and a set of field-value pairs. Supports consumer groups for at-least-once delivery with acknowledgement, similar to Kafka consumer groups but embedded in Redis. Used for event sourcing, activity feeds, audit logs, and lightweight message queues where Kafka is too heavy. XADD appends, XREAD reads, XREADGROUP reads for a consumer group, XACK acknowledges.
Caching Strategies — Cache-Aside, Write-Through, and Write-Behind
The relationship between Redis and your primary database defines your caching strategy. Each pattern trades off consistency, write amplification, and implementation complexity differently. Choosing the wrong pattern for your read/write ratio or consistency requirements is the most common cause of stale data bugs in Redis-backed systems.
Cache-Aside (Lazy Loading)
The application is responsible for populating the cache. On a cache miss, the application reads from the database, writes the result to Redis, and returns it. The cache is only populated with data that is actually requested — no wasteful pre-warming of cold data. The tradeoff is that the first request after a cache miss (or after TTL expiry) hits the database, which adds latency.
# cache_aside.py — Cache-Aside pattern with Redis and PostgreSQL
import json
import redis
import psycopg2
from typing import Any
r = redis.Redis(host="redis", port=6379, decode_responses=True)
def get_product(product_id: str, db_conn) -> dict[str, Any] | None:
cache_key = f"product:{product_id}"
# 1. Try cache first
cached = r.get(cache_key)
if cached is not None:
return json.loads(cached)
# 2. Cache miss — query the database
with db_conn.cursor() as cur:
cur.execute(
"SELECT id, name, price, stock FROM products WHERE id = %s",
(product_id,),
)
row = cur.fetchone()
if row is None:
# Cache negative results to prevent cache stampede on non-existent keys
r.setex(cache_key, 30, json.dumps(None))
return None
product = {"id": row[0], "name": row[1], "price": float(row[2]), "stock": row[3]}
# 3. Populate the cache with a TTL
r.setex(cache_key, 300, json.dumps(product)) # 5-minute TTL
return product
def invalidate_product(product_id: str) -> None:
"""Call this whenever a product is updated in the database."""
r.delete(f"product:{product_id}")Write-Through
Every write goes to both the cache and the database synchronously. The cache is always consistent with the database — a cache hit is guaranteed to return current data. The cost is write latency (two writes per update) and potential cache churn for data that is rarely read after being written. Write-Through is the right choice when read-after-write consistency is critical and write throughput is moderate.
# write_through.py — Write-Through pattern
import json
import redis
import psycopg2
r = redis.Redis(host="redis", port=6379, decode_responses=True)
def update_product_price(product_id: str, new_price: float, db_conn) -> None:
# 1. Write to the primary database first
with db_conn.cursor() as cur:
cur.execute(
"UPDATE products SET price = %s, updated_at = now() WHERE id = %s",
(new_price, product_id),
)
db_conn.commit()
# 2. Update the cache immediately — fetch the full current record
with db_conn.cursor() as cur:
cur.execute(
"SELECT id, name, price, stock FROM products WHERE id = %s",
(product_id,),
)
row = cur.fetchone()
if row:
product = {"id": row[0], "name": row[1], "price": float(row[2]), "stock": row[3]}
r.setex(f"product:{product_id}", 300, json.dumps(product))
else:
# Row deleted between UPDATE and SELECT — invalidate
r.delete(f"product:{product_id}")Write-Behind (Write-Back)
Writes go to Redis immediately; the database write is deferred and batched. This dramatically reduces write latency for write-heavy workloads (session state, counters, real-time analytics). The tradeoff is durability risk: if Redis crashes before the write-behind worker flushes to the database, those writes are lost. Write-Behind is appropriate when some data loss is tolerable and write throughput is the primary constraint.
# write_behind.py — Write-Behind with a Redis List as a write queue
import json
import time
import redis
import threading
r = redis.Redis(host="redis", port=6379, decode_responses=True)
WRITE_QUEUE = "write_queue:products"
def update_product_price_async(product_id: str, new_price: float) -> None:
"""Write to Redis immediately, queue the DB write."""
# Update cache value immediately
cache_key = f"product:{product_id}"
product = r.get(cache_key)
if product:
data = json.loads(product)
data["price"] = new_price
r.setex(cache_key, 300, json.dumps(data))
# Enqueue the DB write
r.rpush(WRITE_QUEUE, json.dumps({
"product_id": product_id,
"price": new_price,
"queued_at": time.time(),
}))
def write_behind_worker(db_conn) -> None:
"""Background thread that flushes queued writes to the database."""
while True:
# Blocking pop — waits up to 5 seconds for a new entry
item = r.blpop(WRITE_QUEUE, timeout=5)
if item is None:
continue
payload = json.loads(item[1])
try:
with db_conn.cursor() as cur:
cur.execute(
"UPDATE products SET price = %s, updated_at = now() WHERE id = %s",
(payload["price"], payload["product_id"]),
)
db_conn.commit()
except Exception as exc:
# On failure, re-enqueue for retry (dead letter queue for repeated failures)
print(f"Write-behind flush failed: {exc}")
r.rpush(WRITE_QUEUE, item[1])
# Start the flush worker in a background thread
# In production: run as a separate process or Kubernetes Job
def start_write_behind(db_conn) -> None:
t = threading.Thread(target=write_behind_worker, args=(db_conn,), daemon=True)
t.start()Note
TTL Management and Eviction Policies
Every key in Redis can have a TTL (time to live) set in seconds or milliseconds. When the TTL expires, Redis lazily removes the key on the next access, and also actively samples expired keys in the background. TTL design is as important as data structure selection — a key with no TTL in a cache is a memory leak waiting to happen.
# ttl_patterns.py — TTL patterns and cache stampede prevention
import json
import random
import time
import redis
r = redis.Redis(host="redis", port=6379, decode_responses=True)
# ----- Jitter to prevent thundering herd -----
def set_with_jitter(key: str, value: str, base_ttl: int, jitter_pct: float = 0.1) -> None:
"""
Add random jitter to TTLs so mass-expiry events don't spike DB load.
jitter_pct=0.1 means ±10% of base_ttl.
"""
jitter = int(base_ttl * jitter_pct * (random.random() * 2 - 1))
ttl = max(1, base_ttl + jitter)
r.setex(key, ttl, value)
# ----- Probabilistic early expiry (XFetch algorithm) -----
def get_or_compute(
key: str,
compute_fn,
ttl: int,
beta: float = 1.0,
) -> dict:
"""
XFetch: recompute slightly before expiry to avoid cache stampede.
A worker recomputes when: current_time - beta * delta * log(random()) > expiry
"""
raw = r.get(key)
if raw is not None:
data = json.loads(raw)
remaining = r.ttl(key)
# Probabilistic early recompute
if remaining < ttl * 0.2: # within 20% of expiry
threshold = -beta * data.get("_delta", 1.0) * (import_math_log(random.random()))
if remaining < threshold:
raw = None # trigger recompute
if raw is None:
start = time.monotonic()
value = compute_fn()
delta = time.monotonic() - start
payload = {**value, "_delta": delta}
set_with_jitter(key, json.dumps(payload), ttl)
return value
return {k: v for k, v in json.loads(raw).items() if k != "_delta"}
def import_math_log(x: float) -> float:
import math
return math.log(max(x, 1e-10))When Redis runs out of memory, it applies an eviction policy to free space. The policy is configured in redis.conf or via CONFIG SET maxmemory-policy:
# redis.conf — eviction policy configuration
maxmemory 4gb # hard memory cap — Redis will not allocate beyond this
maxmemory-policy allkeys-lru # evict any key using LRU approximation when memory is full
# Policy options:
# noeviction — return errors on writes when memory is full (default — bad for caches)
# allkeys-lru — evict least recently used key across ALL keys (best for pure caches)
# volatile-lru — LRU eviction of keys WITH a TTL only (good for mixed use)
# allkeys-lfu — evict least frequently used key (better than LRU for skewed access)
# volatile-lfu — LFU of keys with TTL only
# allkeys-random — evict random key (rarely the right choice)
# volatile-ttl — evict the key with the shortest TTL first
# For a pure cache: allkeys-lru or allkeys-lfu
# For a mixed cache+data store: volatile-lru (ensures data-without-TTL is never evicted)
# For a queue/session store: noeviction + monitoring to stay under the limitPub/Sub vs Streams — Choosing the Right Messaging Model
Redis offers two messaging primitives with very different semantics. Choosing the wrong one is a common mistake that leads to lost messages in production.
Pub/Sub — fire-and-forget fanout
PUBLISH sends a message to all currently subscribed clients. There is no persistence — if no subscriber is connected, the message is lost. If a subscriber is slow or disconnects, messages are lost during the downtime window. Pub/Sub is appropriate for real-time notifications (chat messages, live dashboard updates) where occasional loss is acceptable. Not appropriate for anything that requires at-least-once delivery.
Streams — durable, consumer-group log
XADD appends a message to a persistent log. XREADGROUP delivers messages to a consumer group, tracking which messages have been delivered and which have been acknowledged (XACK). Unacknowledged messages are automatically re-delivered after a timeout (XAUTOCLAIM). Streams provide at-least-once delivery semantics, message history, consumer group fan-out, and the ability for slow consumers to catch up. Use Streams whenever you cannot afford to lose messages.
# pubsub_example.py — Redis Pub/Sub publisher and subscriber
import redis
import threading
import json
r_pub = redis.Redis(host="redis", port=6379, decode_responses=True)
# Publisher — sends a live price update to all connected subscribers
def publish_price_update(product_id: str, new_price: float) -> None:
payload = json.dumps({"product_id": product_id, "price": new_price})
subscriber_count = r_pub.publish("price_updates", payload)
print(f"Published to {subscriber_count} subscribers")
# Subscriber — listens for price updates in a background thread
def subscribe_price_updates() -> None:
r_sub = redis.Redis(host="redis", port=6379, decode_responses=True)
pubsub = r_sub.pubsub()
pubsub.subscribe("price_updates")
for message in pubsub.listen():
if message["type"] != "message":
continue
data = json.loads(message["data"])
print(f"Price update received: {data['product_id']} -> {data['price']}")
# Pattern subscribe — receive all events matching a glob pattern
def subscribe_all_events() -> None:
r_sub = redis.Redis(host="redis", port=6379, decode_responses=True)
pubsub = r_sub.pubsub()
pubsub.psubscribe("*_updates") # matches price_updates, inventory_updates, etc.
for message in pubsub.listen():
if message["type"] != "pmessage":
continue
print(f"[{message['channel']}] {message['data']}")Redis Streams — Consumer Groups and At-Least-Once Delivery
Redis Streams are the right tool for any workload that cannot lose messages. The consumer group model mirrors Kafka consumer groups: multiple consumers share a group, each message is delivered to exactly one consumer in the group, and consumers must acknowledge messages explicitly. Unacknowledged messages are tracked in the Pending Entries List (PEL) and re-delivered to other consumers after a configurable timeout.
# streams_producer.py — Producing messages to a Redis Stream
import redis
import json
import time
r = redis.Redis(host="redis", port=6379, decode_responses=True)
STREAM_KEY = "orders:events"
MAX_STREAM_LEN = 100_000 # trim stream to avoid unbounded growth
def publish_order_event(order_id: str, event_type: str, payload: dict) -> str:
"""
Append an event to the orders stream.
Returns the auto-generated message ID (millisecond-timestamp-sequence).
"""
message_id = r.xadd(
STREAM_KEY,
{
"order_id": order_id,
"event_type": event_type,
"payload": json.dumps(payload),
"published_at": str(int(time.time() * 1000)),
},
maxlen=MAX_STREAM_LEN, # MAXLEN with ~ for approximate trimming (fast)
approximate=True,
)
return message_id
# Example usage
if __name__ == "__main__":
mid = publish_order_event(
order_id="ord-abc123",
event_type="OrderPlaced",
payload={"customer_id": "cust-xyz", "total": 149.99},
)
print(f"Published: {mid}")# streams_consumer.py — Consumer group worker with acknowledgement and claim
import redis
import json
import time
import socket
r = redis.Redis(host="redis", port=6379, decode_responses=True)
STREAM_KEY = "orders:events"
GROUP_NAME = "order-processor"
CONSUMER_NAME = socket.gethostname() # unique per pod/instance
BLOCK_MS = 2000 # block up to 2s waiting for new messages
CLAIM_IDLE_MS = 30_000 # reclaim messages idle for 30s (dead consumer recovery)
BATCH_SIZE = 10
def ensure_consumer_group() -> None:
try:
r.xgroup_create(STREAM_KEY, GROUP_NAME, id="0", mkstream=True)
except redis.exceptions.ResponseError as e:
if "BUSYGROUP" not in str(e):
raise
def process_message(message_id: str, fields: dict) -> None:
event_type = fields.get("event_type")
payload = json.loads(fields.get("payload", "{}"))
order_id = fields.get("order_id")
print(f"Processing [{event_type}] order={order_id}")
# ... your business logic here
def run_consumer() -> None:
ensure_consumer_group()
while True:
# --- Phase 1: Claim any idle messages from dead consumers ---
claimed = r.xautoclaim(
STREAM_KEY,
GROUP_NAME,
CONSUMER_NAME,
min_idle_time=CLAIM_IDLE_MS,
start_id="0-0",
count=BATCH_SIZE,
)
# xautoclaim returns (next_start_id, messages, deleted_ids)
for message_id, fields in claimed[1]:
try:
process_message(message_id, fields)
r.xack(STREAM_KEY, GROUP_NAME, message_id)
except Exception as exc:
print(f"Failed to process {message_id}: {exc}")
# --- Phase 2: Read new messages ---
messages = r.xreadgroup(
GROUP_NAME,
CONSUMER_NAME,
{STREAM_KEY: ">"}, # ">" means: only undelivered messages
count=BATCH_SIZE,
block=BLOCK_MS,
)
if not messages:
continue
for stream_key, entries in messages:
for message_id, fields in entries:
try:
process_message(message_id, fields)
r.xack(STREAM_KEY, GROUP_NAME, message_id)
except Exception as exc:
print(f"Failed to process {message_id}: {exc}")
# Do NOT ack — message stays in PEL for re-delivery
if __name__ == "__main__":
run_consumer()Note
XAUTOCLAIM call (available from Redis 6.2) is the modern replacement for the XPENDING + XCLAIM pattern. It atomically finds and claims messages that have been idle (not acknowledged) for longer than min_idle_time milliseconds — recovering work from crashed consumers without a separate maintenance job.Node.js Patterns — Rate Limiting and Distributed Locks
Two of the most valuable Redis use cases beyond caching are rate limiting and distributed locking. Both require atomic operations that span multiple Redis commands — the right tool for this is Lua scripting, which executes atomically on the Redis server without a round-trip between commands.
// rate-limiter.ts — Sliding window rate limiter using Redis Sorted Sets
import Redis from "ioredis";
const redis = new Redis({ host: "redis", port: 6379 });
interface RateLimitResult {
allowed: boolean;
remaining: number;
resetAt: number;
}
/**
* Sliding window rate limiter.
* Stores each request as a member of a Sorted Set with the timestamp as score.
* Atomically counts requests in the current window and removes expired entries.
*/
const RATE_LIMIT_SCRIPT = `
local key = KEYS[1]
local now = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local limit = tonumber(ARGV[3])
local window_start = now - window
-- Remove expired entries
redis.call('ZREMRANGEBYSCORE', key, '-inf', window_start)
-- Count requests in the current window
local count = redis.call('ZCARD', key)
if count < limit then
-- Allow: add this request with the current timestamp as score
redis.call('ZADD', key, now, now .. ':' .. math.random(1000000))
redis.call('EXPIRE', key, math.ceil(window / 1000))
return {1, limit - count - 1, now + window}
else
-- Deny: return the timestamp of the oldest entry (when the window resets)
local oldest = redis.call('ZRANGE', key, 0, 0, 'WITHSCORES')
local reset_at = oldest[2] and (tonumber(oldest[2]) + window) or (now + window)
return {0, 0, reset_at}
end
`;
export async function checkRateLimit(
identifier: string,
limitPerWindow: number,
windowMs: number
): Promise<RateLimitResult> {
const key = `ratelimit:${identifier}`;
const now = Date.now();
const result = await redis.eval(
RATE_LIMIT_SCRIPT,
1,
key,
String(now),
String(windowMs),
String(limitPerWindow)
) as [number, number, number];
return {
allowed: result[0] === 1,
remaining: result[1],
resetAt: result[2],
};
}
// Express middleware
export function rateLimitMiddleware(limit: number, windowMs: number) {
return async (req: any, res: any, next: any) => {
const identifier = req.ip;
const { allowed, remaining, resetAt } = await checkRateLimit(
identifier,
limit,
windowMs
);
res.setHeader("X-RateLimit-Limit", limit);
res.setHeader("X-RateLimit-Remaining", remaining);
res.setHeader("X-RateLimit-Reset", Math.ceil(resetAt / 1000));
if (!allowed) {
return res.status(429).json({ error: "Too Many Requests" });
}
next();
};
}// distributed-lock.ts — Redlock-style distributed lock
import Redis from "ioredis";
import crypto from "crypto";
const redis = new Redis({ host: "redis", port: 6379 });
const LOCK_RELEASE_SCRIPT = `
if redis.call('get', KEYS[1]) == ARGV[1] then
return redis.call('del', KEYS[1])
else
return 0
end
`;
interface Lock {
key: string;
token: string;
release: () => Promise<boolean>;
}
/**
* Acquire a distributed lock with a TTL.
* Returns null if the lock is already held.
* The token ensures only the lock holder can release it.
*/
export async function acquireLock(
resource: string,
ttlMs: number
): Promise<Lock | null> {
const key = `lock:${resource}`;
const token = crypto.randomUUID();
// SET NX PX: set only if not exists, with millisecond expiry — atomic
const result = await redis.set(key, token, "NX", "PX", ttlMs);
if (result !== "OK") {
return null; // Lock is held by another process
}
return {
key,
token,
release: async () => {
// Lua script ensures we only delete our own lock token
const released = await redis.eval(LOCK_RELEASE_SCRIPT, 1, key, token);
return released === 1;
},
};
}
// Usage example
async function processOrder(orderId: string): Promise<void> {
const lock = await acquireLock(`order:${orderId}`, 30_000);
if (!lock) {
throw new Error(`Order ${orderId} is already being processed`);
}
try {
// Exclusive processing — no other instance can enter this block
console.log(`Processing order ${orderId}`);
await new Promise((resolve) => setTimeout(resolve, 1000));
} finally {
await lock.release();
}
}Redis Cluster — Horizontal Sharding and High Availability
A single Redis instance is limited to the memory and CPU of one machine. Redis Cluster solves this by sharding keys across multiple primary nodes using a consistent hash slot space of 16,384 slots. Each primary node owns a range of slots. The cluster handles rebalancing automatically when nodes are added or removed.
Redis Cluster also provides high availability: each primary has one or more replicas. If a primary fails, the cluster promotes a replica automatically within a few seconds. Applications connect to any cluster node and are redirected (via MOVED and ASK redirects) to the correct shard.
# cluster_config.yaml — Redis Cluster configuration (6 nodes: 3 primary + 3 replica)
# redis-primary-1.conf
port 7001
cluster-enabled yes
cluster-config-file nodes-7001.conf
cluster-node-timeout 5000
appendonly yes
appendfsync everysec
save 900 1
save 300 10
# redis-primary-2.conf
port 7002
cluster-enabled yes
cluster-config-file nodes-7002.conf
cluster-node-timeout 5000
appendonly yes
appendfsync everysec
# redis-primary-3.conf
port 7003
cluster-enabled yes
cluster-config-file nodes-7003.conf
cluster-node-timeout 5000
appendonly yes
appendfsync everysec
# Create the cluster (run once)
# redis-cli --cluster create \
# 127.0.0.1:7001 127.0.0.1:7002 127.0.0.1:7003 \
# 127.0.0.1:7004 127.0.0.1:7005 127.0.0.1:7006 \
# --cluster-replicas 1# cluster_client.py — Redis Cluster client with hash tags for co-location
from redis.cluster import RedisCluster, ClusterNode
startup_nodes = [
ClusterNode("redis-primary-1", 7001),
ClusterNode("redis-primary-2", 7002),
ClusterNode("redis-primary-3", 7003),
]
rc = RedisCluster(startup_nodes=startup_nodes, decode_responses=True)
# Hash tags force multiple keys to the same slot
# Keys with the same {tag} hash to the same slot, enabling multi-key operations
def get_user_session(user_id: str) -> dict:
# {user_id} is the hash tag — all three keys land on the same slot
session_key = f"{{user:{user_id}}}:session"
cart_key = f"{{user:{user_id}}}:cart"
prefs_key = f"{{user:{user_id}}}:prefs"
# MGET works across keys on different slots... but ONLY if they're on the same slot
# Without the hash tag, MGET on a cluster raises CrossSlotKeysError
session, cart, prefs = rc.mget([session_key, cart_key, prefs_key])
return {
"session": session,
"cart_items": int(cart or 0),
"preferences": prefs,
}
# Pipeline in cluster mode — commands are batched per shard automatically
def batch_set_user_data(user_data: list[dict]) -> None:
pipe = rc.pipeline()
for user in user_data:
uid = user["id"]
pipe.setex(f"user:{uid}:profile", 3600, str(user))
pipe.execute()Note
MGET, MSET, and Lua scripts are only valid in Redis Cluster if all keys hash to the same slot. Use hash tags — the {tag} portion of the key — to force related keys onto the same shard. Without this, you will hit CROSSSLOT errors at runtime.Persistence — RDB Snapshots vs AOF Logging
Redis is an in-memory database, but it offers two persistence mechanisms that can be used independently or together. Choosing the right durability model depends on how much data loss your system can tolerate after a crash.
RDB (Redis Database) — point-in-time snapshots
Redis forks a child process that writes the entire dataset to disk as a compact binary file. The parent continues serving requests without blocking. RDB snapshots are fast to restore (Redis reads one file on startup), small on disk, and good for disaster recovery. The tradeoff is that you lose all writes since the last snapshot — typically minutes of data. Configure with SAVE directives: "save 900 1" (snapshot if at least 1 change in 900 seconds), "save 300 10" (at least 10 changes in 300 seconds).
AOF (Append Only File) — write-ahead log
Every write command is appended to the AOF log. On restart, Redis replays the log to reconstruct state. AOF provides much better durability than RDB — with appendfsync everysec you lose at most 1 second of data; with appendfsync always you lose no data but take a performance hit. AOF files grow indefinitely and are compacted periodically via BGREWRITEAOF. AOF is slower to restore than RDB for large datasets because the entire command log must be replayed.
RDB + AOF together (recommended for production)
When both are enabled, Redis uses the AOF for startup (better durability). RDB provides a compact backup for disaster recovery. Use appendfsync everysec for the AOF (good balance of durability and performance) and periodic RDB snapshots for fast restores. This is the recommended production configuration for any Redis deployment where data loss is unacceptable.
# redis.conf — Production persistence configuration
# RDB snapshots
save 900 1 # snapshot if >= 1 key changed in 900 seconds
save 300 10 # snapshot if >= 10 keys changed in 300 seconds
save 60 10000 # snapshot if >= 10000 keys changed in 60 seconds
dbfilename dump.rdb
dir /var/lib/redis
# AOF
appendonly yes
appendfilename "appendonly.aof"
appendfsync everysec # fsync every second — good balance
# appendfsync always # fsync on every write — max durability, lower throughput
# appendfsync no # let OS flush — fast but can lose up to 30s of data
no-appendfsync-on-rewrite yes # don't fsync during BGREWRITEAOF
auto-aof-rewrite-percentage 100 # rewrite AOF when it grows 100% larger than base
auto-aof-rewrite-min-size 64mb # but only if AOF is at least 64MB
# Memory
maxmemory 3gb
maxmemory-policy allkeys-lru
# Slow log — log commands slower than 10ms
slowlog-log-slower-than 10000
slowlog-max-len 1000Production Operations Checklist
Set maxmemory and an eviction policy — always
A Redis instance without a memory cap and eviction policy will grow until the OS kills it. Set maxmemory to 75-80% of the machine's available RAM (leave headroom for fork during RDB/AOF rewrite). For pure caches, use allkeys-lru. For mixed stores, use volatile-lru. Never run production Redis with noeviction unless you are monitoring memory obsessively.
Monitor the keyspace hit rate
The hit rate (keyspace_hits / (keyspace_hits + keyspace_misses)) tells you how effective your cache is. A hit rate below 80% usually means TTLs are too short, the cache is too small, or keys are being evicted too aggressively. Monitor with: redis-cli INFO stats | grep keyspace. Alert if the hit rate drops more than 10 percentage points from baseline.
Use connection pooling, not individual connections
Each Redis connection uses a file descriptor and memory on both client and server. Thousands of short-lived connections degrade performance significantly. Use a connection pool (redis-py's ConnectionPool, ioredis's built-in pooling) sized to your application's concurrency. For serverless functions (Lambda, Cloud Functions), use Redis Cluster with a persistent TCP connection per cold start rather than creating a new connection per invocation.
Avoid KEYS in production
KEYS pattern scans the entire keyspace and blocks the Redis event loop for the duration. On a large dataset, this can stall all other operations for seconds. Use SCAN instead — it iterates in small batches (cursor-based), allowing other commands to interleave. Similarly, avoid SMEMBERS on large Sets (use SSCAN) and HGETALL on large Hashes (use HSCAN).
Design for single-threaded command execution
Redis processes commands in a single thread (with I/O threading available from Redis 6.0 for reading/writing). A single slow command (SORT on a large list, KEYS, a badly written Lua script) blocks ALL other clients. Keep command complexity O(1) or O(log N) where possible. Profile with SLOWLOG RESET followed by SLOWLOG GET after a load test to find problematic commands.
Test failover before you need it
Redis Sentinel and Redis Cluster both handle primary failure automatically, but the failover process takes 5-30 seconds. During this window, writes may fail or be lost. Test failover regularly by killing a primary node: verify that your application handles ECONNREFUSED gracefully, that clients reconnect to the new primary, and that the failover completes within your SLA. Document the expected data loss window from your persistence configuration.
Further Reading
- Redis — Common Application Patterns — official docs for caching, pub/sub, leaderboards, and distributed locks
- Redis Streams — Introduction — complete guide to the Streams data type and consumer groups
- Redis Cluster Specification — the definitive reference for hash slots, failover, and resharding
- Redis Persistence — RDB and AOF — official durability guide with configuration recommendations
- Martin Kleppmann — How to do distributed locking — a thorough analysis of Redlock and the limits of Redis-based distributed locks
Work with us
Building a Redis-backed system or hitting performance and reliability limits with your current caching setup?
We design and implement production Redis architectures — from caching strategy selection and eviction policy tuning to Redis Streams consumer group pipelines, Cluster sharding with hash tag design, distributed locking, rate limiting, and persistence configuration for your durability requirements. Let’s talk.
Get in touch