Why ClickHouse Outperforms OLTP Databases for Analytics
The fundamental mismatch between OLTP databases and analytics workloads comes down to storage layout. PostgreSQL, MySQL, and most relational databases store data row by row: all columns for a given row are colocated on disk. This is optimal for transactional queries that read or write a complete record — a user lookup, an order insert — but catastrophically inefficient for analytics queries that aggregate a single column across millions of rows. Reading SELECT sum(revenue) FROM events in a row-oriented store requires loading every byte of every row even though you only need one column.
ClickHouse stores data column by column. Each column is a separate file on disk, compressed independently. An aggregation query reads only the columns it references, achieving 10–100x lower I/O compared to row-oriented stores. The columnar layout also enables far more effective compression: a column of event types (e.g., “click”, “view”, “purchase”) compresses dramatically better than mixed-type rows. Typical ClickHouse datasets compress to 5–10% of their raw size.
On top of columnar storage, ClickHouse applies vectorized query execution: queries process data in batches of 8,192 rows (a “block”), applying CPU SIMD instructions across the entire batch simultaneously. Combined with a multithreaded query engine that uses all available CPU cores by default, ClickHouse can scan and aggregate billions of rows per second on commodity hardware. Production deployments regularly achieve sub-second query response on datasets with hundreds of billions of rows.
Note
MergeTree Engine Family — The Core of Every ClickHouse Table
Every production ClickHouse table uses a variant of the MergeTreeengine family. MergeTree is ClickHouse's primary storage engine: data is written in sorted parts, which are then merged in the background — similar to LSM trees but with additional semantics for deduplication, aggregation, and replication. The ORDER BY clause defines the sorting key and determines which primary index ClickHouse builds for fast range lookups. The PARTITION BY clause splits data into physical partitions (typically by date) that can be dropped, detached, or attached atomically.
Choosing the right ORDER BY key is the single most impactful schema decision you will make. The primary index — a sparse index with one entry per 8,192-row granule — is built on the ORDER BY columns. Queries that filter on a prefix of the ORDER BY key skip entire granules and partitions, dramatically reducing the data scanned. Cardinality matters: low-cardinality columns (tenant_id, event_type) should come before high-cardinality ones (user_id, session_id) in the ORDER BY to maximize granule skipping for filtered aggregations.
-- schema_events.sql
-- Analytics event table using MergeTree
-- Designed for queries like:
-- SELECT event_type, count() FROM events
-- WHERE tenant_id = 'acme' AND toDate(timestamp) BETWEEN '2026-01-01' AND '2026-01-31'
-- GROUP BY event_type
CREATE TABLE events
(
-- Partition by month for efficient time-range pruning and partition drops
timestamp DateTime64(3, 'UTC'),
tenant_id LowCardinality(String), -- low cardinality: ~1000 tenants
event_type LowCardinality(String), -- low cardinality: ~50 event types
user_id UUID,
session_id String,
page_url String,
referrer String,
-- Flexible payload stored as JSON (ClickHouse 22.6+ JSON type, or String)
properties String,
-- Numeric metrics
duration_ms UInt32,
revenue_cents UInt32 DEFAULT 0,
-- Codec: ZSTD(3) for best compression on string columns;
-- Delta + ZSTD for monotonically increasing timestamps
INDEX idx_user_id user_id TYPE bloom_filter(0.01) GRANULARITY 4
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(timestamp)
ORDER BY (tenant_id, event_type, toDate(timestamp), user_id)
-- TTL: auto-expire data older than 90 days
TTL toDate(timestamp) + INTERVAL 90 DAY DELETE
SETTINGS
index_granularity = 8192,
-- Merge parts more aggressively for high-insert workloads
merge_max_block_size = 8192,
-- Store deleted rows as tombstones until next merge (for GDPR deletes)
allow_experimental_lightweight_delete = 1;
-- Projection: pre-aggregated daily revenue per tenant
-- ClickHouse picks this automatically for matching queries
ALTER TABLE events ADD PROJECTION proj_daily_revenue
(
SELECT
tenant_id,
toDate(timestamp) AS date,
sum(revenue_cents) AS total_revenue,
count() AS event_count
ORDER BY (tenant_id, date)
);
ALTER TABLE events MATERIALIZE PROJECTION proj_daily_revenue;MergeTree Variants
The MergeTree family includes specialized variants for common analytics patterns. Choose the right engine at table creation — changing it requires a full table rebuild.
| Engine | Merge behavior | Best for |
|---|---|---|
| MergeTree | Sorts and merges parts; no dedup | Raw event storage, immutable logs |
| ReplacingMergeTree | Keeps latest version by ORDER BY key | Mutable entities (user profiles, order state) |
| SummingMergeTree | Sums numeric columns during merge | Pre-aggregated counters (page views, clicks) |
| AggregatingMergeTree | Merges AggregateFunction states | Materialized views with complex aggregations |
| CollapsingMergeTree | Cancels pairs of rows with sign +1/-1 | Streams of state changes (inventory, balances) |
Note
LowCardinality(String)is one of the most impactful type choices available. It encodes values as a dictionary — internally storing an integer ID instead of the full string — which reduces storage and dramatically speeds up GROUP BY on low-cardinality columns. Use it for any string column with fewer than ~10,000 distinct values: tenant IDs, event types, country codes, status fields. The breakeven is roughly >50% compression gain vs. raw String storage.Materialized Views — Pre-Aggregation, Kafka Ingestion, and AggregatingMergeTree
ClickHouse materialized views are triggers: they fire on every INSERT into the source table and write transformed or aggregated rows to a target table. Unlike PostgreSQL materialized views — which are snapshots that require manual refresh — ClickHouse materialized views are kept current in real-time as data arrives. This makes them the primary tool for pre-aggregating analytics results that would otherwise require scanning billions of rows at query time.
The recommended pattern is to combine a materialized view with an AggregatingMergeTree target table using partial aggregation states. Rather than storing the final aggregated value (which cannot be re-aggregated across parts), you store intermediate aggregation states using AggregateFunction types. These states are merged correctly during background merges, producing exact results even when parts are not yet fully merged. Query the result using -Merge combinator functions.
-- materialized_views.sql
-- Step 1: Target table using AggregatingMergeTree
-- Stores partial aggregation states — merged correctly across parts
CREATE TABLE events_hourly_agg
(
tenant_id LowCardinality(String),
event_type LowCardinality(String),
hour DateTime,
-- AggregateFunction types store intermediate states, not final values
total_events AggregateFunction(count),
unique_users AggregateFunction(uniq, UUID),
total_revenue AggregateFunction(sum, UInt32),
p95_duration AggregateFunction(quantile(0.95), UInt32)
)
ENGINE = AggregatingMergeTree
PARTITION BY toYYYYMM(hour)
ORDER BY (tenant_id, event_type, hour);
-- Step 2: Materialized view — fires on every INSERT to events
-- Writes partial aggregation states to events_hourly_agg
CREATE MATERIALIZED VIEW mv_events_hourly
TO events_hourly_agg
AS
SELECT
tenant_id,
event_type,
toStartOfHour(timestamp) AS hour,
countState() AS total_events,
uniqState(user_id) AS unique_users,
sumState(revenue_cents) AS total_revenue,
quantileState(0.95)(duration_ms) AS p95_duration
FROM events
GROUP BY tenant_id, event_type, hour;
-- Step 3: Query using -Merge combinators
-- ClickHouse merges partial states from all parts on the fly
SELECT
tenant_id,
event_type,
hour,
countMerge(total_events) AS events,
uniqMerge(unique_users) AS unique_users,
sumMerge(total_revenue) / 100.0 AS revenue_usd,
quantileMerge(0.95)(p95_duration) AS p95_duration_ms
FROM events_hourly_agg
WHERE tenant_id = 'acme'
AND hour >= now() - INTERVAL 7 DAY
GROUP BY tenant_id, event_type, hour
ORDER BY hour DESC;Kafka Engine — Real-Time Ingestion Pipeline
The Kafka table engine lets ClickHouse consume messages directly from Kafka topics without an external connector. The canonical pattern uses three objects: a Kafka engine table (the consumer), a landing MergeTree table (the durable store), and a materialized view that moves data between them. ClickHouse manages offsets using its internal consumer group, providing at-least-once delivery semantics.
-- kafka_ingestion.sql
-- Pattern: Kafka engine → Materialized view → Landing table
-- 1. Kafka consumer table — reads raw JSON messages from the topic
CREATE TABLE kafka_events_raw
(
raw String -- read as raw JSON; parse in the materialized view
)
ENGINE = Kafka
SETTINGS
kafka_broker_list = 'kafka-broker:9092',
kafka_topic_list = 'analytics.events',
kafka_group_name = 'clickhouse-consumer-events',
kafka_format = 'JSONAsString',
-- Tune for throughput: consume up to 10k messages per poll
kafka_max_block_size = 10000,
kafka_skip_broken_messages = 100;
-- 2. Landing table — durable MergeTree storage
CREATE TABLE events_raw_landing
(
timestamp DateTime64(3, 'UTC') DEFAULT now64(),
tenant_id LowCardinality(String),
event_type LowCardinality(String),
user_id UUID,
session_id String,
page_url String,
duration_ms UInt32,
revenue_cents UInt32 DEFAULT 0,
raw_json String
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(timestamp)
ORDER BY (tenant_id, event_type, timestamp);
-- 3. Materialized view — parses JSON and inserts into landing table
-- Fires automatically for every batch consumed from Kafka
CREATE MATERIALIZED VIEW mv_kafka_to_landing
TO events_raw_landing
AS
SELECT
parseDateTimeBestEffort(
JSONExtractString(raw, 'timestamp')
) AS timestamp,
JSONExtractString(raw, 'tenant_id') AS tenant_id,
JSONExtractString(raw, 'event_type') AS event_type,
toUUID(JSONExtractString(raw, 'user_id')) AS user_id,
JSONExtractString(raw, 'session_id') AS session_id,
JSONExtractString(raw, 'page_url') AS page_url,
JSONExtractUInt(raw, 'duration_ms') AS duration_ms,
JSONExtractUInt(raw, 'revenue_cents') AS revenue_cents,
raw AS raw_json
FROM kafka_events_raw;Note
kafka_num_consumers equal to the number of Kafka partitions. Each consumer runs in its own thread and consumes from a dedicated partition assignment. Do not create multiple Kafka engine tables pointing to the same topic and consumer group — this will split messages across tables unpredictably.Query Optimization — PREWHERE, Skip Indexes, and Projections
ClickHouse's query engine is fast by default, but three mechanisms — PREWHERE, skip indexes, and projections — can reduce scanned data by additional orders of magnitude for specific access patterns.
PREWHERE
PREWHERE is a ClickHouse-specific extension to WHERE. Filters in PREWHERE are evaluated before reading non-filtered columns — ClickHouse reads only the PREWHERE columns, builds a row mask of matching rows, then reads the remaining columns only for rows that pass. This is especially effective when a selective filter column is small (e.g., a flag or status) and the table has many wide columns. ClickHouse automatically moves eligible WHERE conditions to PREWHERE, but explicit PREWHERE gives you control when the optimizer's choice is suboptimal.
-- query_optimization.sql
-- PREWHERE example: filter on tenant_id first (small column),
-- then read the wide payload column only for matching rows
SELECT
event_type,
count() AS events,
uniq(user_id) AS unique_users,
avg(duration_ms) AS avg_duration_ms
FROM events
PREWHERE tenant_id = 'acme' -- evaluated first, minimal I/O
WHERE toDate(timestamp) >= today() - 7 -- further filtered after PREWHERE
GROUP BY event_type
ORDER BY events DESC
LIMIT 20;
-- Skip index: bloom filter on user_id
-- Skips granules that definitely don't contain the queried user_id
-- Useful for needle-in-haystack lookups on non-ORDER-BY columns
CREATE TABLE events_with_skip_idx
(
timestamp DateTime64(3, 'UTC'),
tenant_id LowCardinality(String),
event_type LowCardinality(String),
user_id UUID,
-- bloom_filter: false positive rate 1%; GRANULARITY 4 = index one entry per 4 granules
INDEX idx_user user_id TYPE bloom_filter(0.01) GRANULARITY 4,
-- minmax: skip granules where duration_ms is outside the queried range
INDEX idx_dur duration_ms TYPE minmax GRANULARITY 1,
-- set: for columns with few distinct values per granule (e.g., country_code)
INDEX idx_event event_type TYPE set(50) GRANULARITY 2
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(timestamp)
ORDER BY (tenant_id, toDate(timestamp));
-- Using EXPLAIN to verify index usage and granule skipping
EXPLAIN indexes = 1
SELECT count()
FROM events
WHERE tenant_id = 'acme'
AND user_id = '550e8400-e29b-41d4-a716-446655440000';
-- Look for "Granules: X/Y" — X granules read out of Y total.
-- A well-tuned query should read <5% of granules.Projections
A projectionis a stored pre-aggregation embedded inside the same table. Unlike a materialized view, a projection shares the table's INSERT path — data written to the table is automatically projected without additional application changes. ClickHouse's query planner automatically picks projections when they satisfy a query's GROUP BY and ORDER BY requirements, reducing query latency from seconds to milliseconds for matching aggregation patterns.
-- projections.sql
-- Add a projection to the events table for fast daily revenue queries
ALTER TABLE events ADD PROJECTION proj_daily_revenue
(
SELECT
tenant_id,
toDate(timestamp) AS date,
sum(revenue_cents) AS total_revenue,
count() AS total_events,
uniq(user_id) AS unique_users
ORDER BY (tenant_id, date)
);
-- Materialize the projection for existing data
ALTER TABLE events MATERIALIZE PROJECTION proj_daily_revenue
SETTINGS mutations_sync = 2; -- wait for all replicas to apply
-- This query will now use the projection automatically (verify with EXPLAIN)
SELECT
toDate(timestamp) AS date,
sum(revenue_cents) AS revenue,
count() AS events
FROM events
WHERE tenant_id = 'acme'
AND toDate(timestamp) >= today() - 30
GROUP BY date
ORDER BY date;Cluster Setup — ReplicatedMergeTree, Distributed Tables, and ClickHouse Keeper
ClickHouse achieves high availability through ReplicatedMergeTree, which synchronizes table parts across replica nodes using a coordination service. ClickHouse Keeper — a native re-implementation of ZooKeeper that ships with ClickHouse since version 22.4 — manages replication metadata, distributed DDL queues, and leader election without requiring a separate ZooKeeper cluster. For new deployments, always use ClickHouse Keeper over ZooKeeper.
Horizontal scaling uses the Distributed table engine: a virtual table that shards queries across multiple shards. Each shard is typically a pair of ReplicatedMergeTree replicas. INSERT into the Distributed table distributes rows across shards by the sharding key; SELECT queries fan out to all shards and merge results. The sharding key should distribute data evenly and align with your most common GROUP BY keys to minimize cross-shard data exchange.
-- config_keeper.xml
<!-- ClickHouse Keeper configuration — embed in /etc/clickhouse-server/config.d/ -->
<!-- Run 3 Keeper nodes for quorum-based HA -->
<clickhouse>
<keeper_server>
<tcp_port>9181</tcp_port>
<!-- Unique server ID for each Keeper node (1, 2, 3) -->
<server_id>1</server_id>
<log_storage_path>/var/lib/clickhouse/coordination/log</log_storage_path>
<snapshot_storage_path>/var/lib/clickhouse/coordination/snapshots</snapshot_storage_path>
<coordination_settings>
<operation_timeout_ms>10000</operation_timeout_ms>
<session_timeout_ms>30000</session_timeout_ms>
<raft_logs_level>warning</raft_logs_level>
</coordination_settings>
<raft_configuration>
<server>
<id>1</id><hostname>ch-keeper-1</hostname><port>9234</port>
</server>
<server>
<id>2</id><hostname>ch-keeper-2</hostname><port>9234</port>
</server>
<server>
<id>3</id><hostname>ch-keeper-3</hostname><port>9234</port>
</server>
</raft_configuration>
</keeper_server>
</clickhouse>-- cluster_tables.sql
-- Create replicated tables with ON CLUSTER for automatic DDL propagation
-- ReplicatedMergeTree: uses Keeper for part synchronization
-- {shard} and {replica} macros are resolved per-node from config.xml
CREATE TABLE events ON CLUSTER 'analytics_cluster'
(
timestamp DateTime64(3, 'UTC'),
tenant_id LowCardinality(String),
event_type LowCardinality(String),
user_id UUID,
revenue_cents UInt32 DEFAULT 0
)
ENGINE = ReplicatedMergeTree(
'/clickhouse/tables/{shard}/events', -- Keeper path (unique per shard)
'{replica}' -- replica name from macros
)
PARTITION BY toYYYYMM(timestamp)
ORDER BY (tenant_id, event_type, timestamp);
-- Distributed table: the query layer across all shards
-- Sits on top of the ReplicatedMergeTree tables on each shard
CREATE TABLE events_dist ON CLUSTER 'analytics_cluster'
(
timestamp DateTime64(3, 'UTC'),
tenant_id LowCardinality(String),
event_type LowCardinality(String),
user_id UUID,
revenue_cents UInt32 DEFAULT 0
)
ENGINE = Distributed(
'analytics_cluster', -- cluster name from config
'default', -- database
'events', -- local table name on each shard
-- Sharding key: murmur hash of tenant_id distributes tenants evenly
-- Use cityHash64(tenant_id) to co-locate all a tenant's data on one shard
cityHash64(tenant_id)
);
-- cluster config fragment (config.d/cluster.xml)
-- 2 shards, 2 replicas each (total 4 ClickHouse nodes)
/*
<remote_servers>
<analytics_cluster>
<shard>
<replica><host>ch-1-1</host><port>9000</port></replica>
<replica><host>ch-1-2</host><port>9000</port></replica>
</shard>
<shard>
<replica><host>ch-2-1</host><port>9000</port></replica>
<replica><host>ch-2-2</host><port>9000</port></replica>
</shard>
</analytics_cluster>
</remote_servers>
*/Note
ON CLUSTER for all DDL operations (CREATE TABLE, ALTER TABLE, DROP TABLE) in a cluster. Without it, the DDL only runs on the node you are connected to; the other nodes will not have the table. ClickHouse distributes ON CLUSTER DDL via the Keeper DDL queue — the operation is applied asynchronously on each node. Monitor the system.distributed_ddl_queue table to track DDL execution status across the cluster.Python Integration — clickhouse-connect, Batch Inserts, and Monitoring
The official Python client is clickhouse-connect, which uses ClickHouse's native HTTP interface with binary column-oriented serialization. It supports both synchronous and async patterns, automatic compression, and integration with pandas and PyArrow. For high-throughput inserts, use the streaming insert API to send data in columnar batches rather than row by row.
# clickhouse_client.py
from __future__ import annotations
import os
from datetime import datetime, timezone
from typing import Any
import clickhouse_connect
from clickhouse_connect.driver import Client
CH_HOST = os.environ["CLICKHOUSE_HOST"]
CH_PORT = int(os.environ.get("CLICKHOUSE_PORT", "8123"))
CH_USER = os.environ["CLICKHOUSE_USER"]
CH_PASSWORD = os.environ["CLICKHOUSE_PASSWORD"]
CH_DATABASE = os.environ.get("CLICKHOUSE_DATABASE", "default")
def get_client() -> Client:
"""
Create a clickhouse-connect client.
connect_timeout: initial TCP handshake timeout
send_receive_timeout: query execution timeout
compress: enable LZ4 compression for HTTP transport
"""
return clickhouse_connect.get_client(
host=CH_HOST,
port=CH_PORT,
username=CH_USER,
password=CH_PASSWORD,
database=CH_DATABASE,
connect_timeout=10,
send_receive_timeout=120,
compress=True,
)
def insert_events(
client: Client,
events: list[dict[str, Any]],
table: str = "events",
) -> None:
"""
Batch insert events using columnar format.
clickhouse-connect converts the list of dicts to columnar format internally.
For maximum throughput, batch at least 10k rows per insert call.
Aim for ~1MB–10MB per insert to stay within ClickHouse's recommended range.
"""
if not events:
return
column_names = list(events[0].keys())
data = [[row[col] for col in column_names] for row in events]
client.insert(
table=table,
data=data,
column_names=column_names,
settings={
# Async inserts buffer on the server side — useful for high-frequency
# small inserts (e.g., per-request event tracking)
"async_insert": 0,
"wait_for_async_insert": 1,
},
)
def query_hourly_metrics(
client: Client,
tenant_id: str,
days: int = 7,
) -> list[dict[str, Any]]:
"""
Query pre-aggregated hourly metrics from the AggregatingMergeTree view.
Returns list of dicts with hour, events, unique_users, revenue_usd, p95_duration_ms.
"""
result = client.query(
"""
SELECT
hour,
countMerge(total_events) AS events,
uniqMerge(unique_users) AS unique_users,
round(sumMerge(total_revenue) / 100, 2) AS revenue_usd,
round(quantileMerge(0.95)(p95_duration), 0) AS p95_duration_ms
FROM events_hourly_agg
WHERE tenant_id = {tenant_id:String}
AND hour >= now() - INTERVAL {days:UInt32} DAY
GROUP BY hour
ORDER BY hour DESC
""",
parameters={"tenant_id": tenant_id, "days": days},
)
return [dict(zip(result.column_names, row)) for row in result.result_rows]
def get_cluster_health(client: Client) -> list[dict[str, Any]]:
"""
Query system tables to check replication lag and part counts.
Critical for monitoring ReplicatedMergeTree cluster health.
"""
result = client.query(
"""
SELECT
database,
table,
replica_name,
queue_size,
inserts_in_queue,
merges_in_queue,
log_pointer,
log_max_index,
log_max_index - log_pointer AS replication_lag
FROM system.replicas
WHERE replication_lag > 0
OR queue_size > 100
ORDER BY replication_lag DESC
LIMIT 20
"""
)
return [dict(zip(result.column_names, row)) for row in result.result_rows]Async Inserts for High-Frequency Small Batches
ClickHouse's async insert feature buffers small inserts server-side and flushes them as a single part when the buffer reaches a size or time threshold. This avoids the part explosion that results from thousands of tiny inserts creating thousands of small parts — a common performance killer for high-frequency event tracking workloads where each API request inserts one or a few rows.
# async_insert_config.sql
-- Enable async inserts at the user or session level
-- Server buffers inserts for up to 200ms or until 1MB accumulates
ALTER USER analytics_writer SETTINGS
async_insert = 1,
wait_for_async_insert = 0, -- fire-and-forget (check via system.async_insert_log)
async_insert_max_data_size = 1000000, -- flush when buffer reaches 1MB
async_insert_busy_timeout_ms = 200; -- flush after 200ms regardless of size
-- Monitor async insert queue
SELECT
database,
table,
format,
status,
rows,
exception
FROM system.async_insert_log
WHERE event_time > now() - INTERVAL 1 HOUR
AND status != 'Ok'
ORDER BY event_time DESC
LIMIT 50;Production Patterns — TTL, Tiered Storage, and Monitoring
TTL-Based Data Lifecycle
Use TABLE-level TTL to automatically delete or move partitions as data ages. ClickHouse evaluates TTL during merges and can move old partitions to slower, cheaper storage tiers (e.g., from NVMe to S3 via storage policies) or delete them entirely. This eliminates manual partition management for time-series event data. Define TTL rules at table creation — adding TTL to an existing table requires a background mutation that rewrites all affected parts.
Part Monitoring and Merge Health
ClickHouse's write performance degrades when a table accumulates too many unmerged parts ('Too many parts' error fires at 300 parts per partition by default). Monitor system.parts and alert when active_parts > 150 for any partition. Common causes: ingesting too frequently in tiny batches, background merge threads falling behind, or mutations blocking merges. Increase background_pool_size for write-heavy tables and use async inserts to batch small writes server-side.
Query Profiling with system.query_log
Every query executed by ClickHouse is logged to system.query_log with full metrics: read_rows, read_bytes, memory_usage, query_duration_ms, and the query plan. Query this table to identify slow queries, high-memory queries, and queries reading excessive data. Set up a Grafana dashboard on system.query_log for p95 latency per user and query_kind, and alert on queries exceeding 10 seconds or reading more than 1 billion rows.
Prometheus Metrics via /metrics
ClickHouse exposes a Prometheus-compatible /metrics endpoint (port 8001 by default, configurable). Scrape it with Prometheus and dashboard the critical signals: ClickHouseMetrics_BackgroundPoolTask (background merge queue depth), ClickHouseProfileEvents_InsertedRows (insert throughput), ClickHouseMetrics_ReplicatedChecks (replication health), and ClickHouseAsyncMetrics_UncompressedCacheCells (cache utilization). Add to your standard infrastructure monitoring alongside node CPU and disk I/O.
-- monitoring_queries.sql
-- Key system table queries for production ClickHouse monitoring
-- 1. Parts health: alert if any partition has > 150 active parts
SELECT
database,
table,
partition_id,
count() AS part_count,
sum(rows) AS total_rows,
formatReadableSize(sum(bytes_on_disk)) AS disk_size
FROM system.parts
WHERE active
GROUP BY database, table, partition_id
HAVING part_count > 50
ORDER BY part_count DESC;
-- 2. Replication lag: alert if any replica is > 100 entries behind
SELECT
database,
table,
replica_name,
log_max_index - log_pointer AS lag,
queue_size,
last_queue_update
FROM system.replicas
WHERE log_max_index - log_pointer > 10
ORDER BY lag DESC;
-- 3. Slow queries in the last hour
SELECT
query_start_time,
query_duration_ms,
formatReadableQuantity(read_rows) AS read_rows,
formatReadableSize(read_bytes) AS read_bytes,
formatReadableSize(memory_usage) AS memory_usage,
user,
-- Truncate long queries for display
substring(query, 1, 200) AS query_preview
FROM system.query_log
WHERE type = 'QueryFinish'
AND query_start_time >= now() - INTERVAL 1 HOUR
AND query_duration_ms > 5000
ORDER BY query_duration_ms DESC
LIMIT 20;
-- 4. Insert throughput per table (last 10 minutes)
SELECT
table,
count() AS insert_count,
formatReadableQuantity(sum(read_rows)) AS total_rows_inserted,
formatReadableSize(sum(result_bytes)) AS total_bytes
FROM system.query_log
WHERE type = 'QueryFinish'
AND query_kind = 'Insert'
AND query_start_time >= now() - INTERVAL 10 MINUTE
GROUP BY table
ORDER BY total_rows_inserted DESC;Further Reading
- ClickHouse MergeTree Documentation — complete reference for MergeTree settings, ORDER BY key design, partition functions, TTL syntax, and all engine variants
- Kafka Table Engine — official guide for Kafka consumer configuration, offset management, consumer group behavior, and high-throughput tuning
- ClickHouse Keeper — deployment guide for the native Keeper coordination service, Raft quorum configuration, monitoring, and migration from ZooKeeper
- clickhouse-connect Python Client — async inserts, query parameters, PyArrow and pandas integration, streaming queries, and connection pooling patterns
- EXPLAIN Statement Reference — understanding query plans, index usage, granule skipping, and projection selection to diagnose and fix slow queries
Work with us
Running analytics at scale and hitting the limits of your current database?
We design and implement ClickHouse analytics infrastructure — from MergeTree schema design and AggregatingMergeTree materialized views to Kafka ingestion pipelines, projection and skip index optimization, ReplicatedMergeTree cluster setup with ClickHouse Keeper, and Python clickhouse-connect integration. Let’s talk.
Get in touch