Why Streamlit Fits Data Engineering Work
Data engineers spend most of their time in Python — writing pipeline code, running analytical queries, and debugging data quality issues. Sharing results typically means either a Jupyter notebook (not reproducible without the right kernel), a static CSV (no interactivity), or a Jira comment with a screenshot. Streamlit closes this gap: it converts a plain Python script into a live web application by re-running the script top-to-bottom on every widget interaction. No HTML, no JavaScript, no deployment ceremony for prototypes.
The model is intentionally simple. A Streamlit app is a Python file. Running streamlit run app.py starts a local server. Every slider, selectbox, or button that the user interacts with causes the script to re-execute, and the framework diffs the output to update only the changed widgets. This execution model means you write procedural Python — no callbacks, no reactive graph, no component lifecycle — and the framework handles reactivity. Where Evidence.dev targets code-driven analytics reports with a SQL-first notebook model — Streamlit targets Python-native teams that need full programmatic control over layout, data transformations, and integrations.
Python-Native
Write apps with the same libraries you use in pipelines — pandas, polars, DuckDB, scikit-learn. No DSL, no template engine, no framework lock-in beyond Streamlit itself.
Incremental
Start with st.write(df) and add interactivity incrementally. Every component follows the same pattern: call a function, get a value, use it in the next line.
Deployable
Deploy to Streamlit Community Cloud for free with a GitHub URL, or containerize with Docker and run on Kubernetes for production workloads requiring authentication and resource controls.
Installation and Project Structure
Streamlit is a single PyPI package. Python 3.9+ is required. The recommended project structure separates app logic from data access and keeps secrets out of version control via .streamlit/secrets.toml.
pip install streamlit
# Verify installation
streamlit hello
# Run your app (auto-reloads on file save)
streamlit run app.py
# Run on a specific port (useful in containers)
streamlit run app.py --server.port 8501 --server.address 0.0.0.0# Recommended project layout
myapp/
├── app.py # entry point
├── pages/ # multi-page apps (auto-discovered)
│ ├── 1_Overview.py
│ ├── 2_Pipeline_Health.py
│ └── 3_Data_Quality.py
├── components/ # reusable chart/widget helpers
│ ├── charts.py
│ └── filters.py
├── data/ # data access layer
│ └── queries.py
├── .streamlit/
│ ├── config.toml # theme and server settings (committed)
│ └── secrets.toml # credentials (gitignored)
├── requirements.txt
└── Dockerfile# .streamlit/config.toml — committed to version control
[theme]
base = "dark"
primaryColor = "#22d3ee" # accent colour matching your brand
backgroundColor = "#0A0A0A"
secondaryBackgroundColor = "#111111"
textColor = "#e5e7eb"
font = "monospace"
[server]
headless = true # required for containerised deployments
enableCORS = false
port = 8501
[runner]
magicEnabled = false # disable implicit st.write() for explicit controlCore Data Components — DataFrames, Charts, and Widgets
Streamlit's data display primitives work directly with pandas DataFrames, numpy arrays, and dictionaries. st.dataframe() renders an interactive sortable and filterable table; st.table() renders a static version. For charts, the recommended path for data engineers is Plotly (full control) or Altair (declarative grammar). The built-in st.line_chart() and st.bar_chart() are useful for quick iteration but lack axis labels and custom theming.
import streamlit as st
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
st.set_page_config(
page_title="Pipeline Health Dashboard",
page_icon="📊",
layout="wide",
initial_sidebar_state="expanded",
)
# --- Sidebar filters ---
with st.sidebar:
st.header("Filters")
date_range = st.date_input("Date range", value=[])
pipeline_names = st.multiselect(
"Pipelines",
options=["ingest_orders", "transform_users", "load_analytics", "export_reports"],
default=["ingest_orders", "transform_users"],
)
status_filter = st.radio("Status", ["All", "Failed", "Success", "Running"], index=0)
# --- DataFrame display ---
@st.cache_data(ttl=300)
def load_pipeline_runs(pipelines: list[str]) -> pd.DataFrame:
# Replace with your actual data source
import numpy as np
rng = np.random.default_rng(42)
rows = []
for p in pipelines:
for i in range(20):
rows.append({
"pipeline": p,
"run_id": f"{p}_{i:04d}",
"status": rng.choice(["success", "failed", "running"], p=[0.8, 0.15, 0.05]),
"duration_s": int(rng.exponential(120)),
"rows_processed": int(rng.exponential(50_000)),
"started_at": pd.Timestamp("2026-06-01") + pd.Timedelta(hours=int(i * 6)),
})
return pd.DataFrame(rows)
df = load_pipeline_runs(pipeline_names)
# Apply status filter
if status_filter != "All":
df = df[df["status"] == status_filter.lower()]
# Metrics row
col1, col2, col3, col4 = st.columns(4)
col1.metric("Total Runs", len(df))
col2.metric("Success Rate", f"{(df['status'] == 'success').mean():.1%}")
col3.metric("Avg Duration", f"{df['duration_s'].mean():.0f}s")
col4.metric("Rows Processed", f"{df['rows_processed'].sum():,.0f}")
st.divider()
# Interactive DataFrame with column configuration
st.subheader("Pipeline Runs")
st.dataframe(
df,
column_config={
"status": st.column_config.SelectboxColumn(
"Status",
options=["success", "failed", "running"],
),
"duration_s": st.column_config.NumberColumn("Duration (s)", format="%d s"),
"rows_processed": st.column_config.NumberColumn("Rows", format="%,d"),
"started_at": st.column_config.DatetimeColumn("Started", format="YYYY-MM-DD HH:mm"),
},
use_container_width=True,
hide_index=True,
)
# Plotly chart
fig = px.histogram(
df,
x="duration_s",
color="status",
color_discrete_map={"success": "#22d3ee", "failed": "#f87171", "running": "#a3a3a3"},
nbins=30,
title="Run Duration Distribution",
labels={"duration_s": "Duration (seconds)", "count": "Runs"},
)
fig.update_layout(
paper_bgcolor="rgba(0,0,0,0)",
plot_bgcolor="rgba(0,0,0,0)",
font_color="#e5e7eb",
)
st.plotly_chart(fig, use_container_width=True)Session State — Persisting Values Across Reruns
Because Streamlit re-runs the entire script on every widget interaction, local variables do not persist between runs. st.session_state is a dictionary-like object scoped to a browser session that survives reruns. It is the correct place to accumulate user selections, track multi-step wizard state, or store the results of expensive one-shot operations like an initial data load triggered by a button click.
import streamlit as st
# --- Pattern 1: Initialise with a default ---
if "page" not in st.session_state:
st.session_state.page = "overview"
if "selected_run_id" not in st.session_state:
st.session_state.selected_run_id = None
if "query_history" not in st.session_state:
st.session_state.query_history = []
# --- Pattern 2: Button-triggered one-shot operation ---
# The button returns True only on the click rerun; the result is stored in session_state
if st.button("Run Full Scan"):
with st.spinner("Scanning pipeline metadata..."):
# This only executes once per button click
result = run_expensive_scan() # your actual function
st.session_state.scan_result = result
st.session_state.scan_ran_at = pd.Timestamp.now()
if "scan_result" in st.session_state:
st.success(f"Scan completed at {st.session_state.scan_ran_at}")
st.dataframe(st.session_state.scan_result)
# --- Pattern 3: Multi-step wizard ---
STEPS = ["Select Source", "Configure Transform", "Preview", "Confirm"]
step_idx = st.session_state.get("wizard_step", 0)
st.progress((step_idx + 1) / len(STEPS), text=f"Step {step_idx + 1}: {STEPS[step_idx]}")
if step_idx == 0:
source = st.selectbox("Source table", ["orders", "customers", "products"])
if st.button("Next →"):
st.session_state.wizard_source = source
st.session_state.wizard_step = 1
st.rerun()
elif step_idx == 1:
st.write(f"Configuring transform for: **{st.session_state.wizard_source}**")
agg_col = st.selectbox("Aggregate by", ["day", "week", "month"])
col1, col2 = st.columns(2)
if col1.button("← Back"):
st.session_state.wizard_step = 0
st.rerun()
if col2.button("Next →"):
st.session_state.wizard_agg = agg_col
st.session_state.wizard_step = 2
st.rerun()Note
st.rerun() (formerly st.experimental_rerun()) immediately stops the current script execution and triggers a fresh run. Use it sparingly — it causes a full rerun which re-executes all cached functions and widget calls. It is the correct tool for wizard navigation and post-form-submission redirects, but using it after every state mutation leads to double reruns and confusing behavior.Caching — @st.cache_data and @st.cache_resource
Because every widget interaction triggers a full script rerun, expensive operations — database queries, file reads, model inference — will execute on every interaction without caching. Streamlit provides two caching decorators that cover almost all production use cases: @st.cache_data for functions that return data (DataFrames, lists, dicts) and @st.cache_resource for functions that return shared resources like database connections and ML models that should not be copied between sessions.
import streamlit as st
import pandas as pd
import duckdb
# --- @st.cache_data ---
# Cache functions that transform and return data.
# Results are serialised (pickled) and stored per unique set of arguments.
# Each call with the same arguments returns a deep copy — safe for mutation.
# ttl= controls expiry: "10m", "1h", 3600 (seconds), or timedelta.
@st.cache_data(ttl="10m", show_spinner="Loading pipeline metrics...")
def get_pipeline_metrics(
start_date: str,
end_date: str,
pipeline_names: tuple[str, ...], # use tuple, not list — lists are not hashable
) -> pd.DataFrame:
conn = duckdb.connect("metrics.ddb", read_only=True)
placeholders = ", ".join(f"'{p}'" for p in pipeline_names)
df = conn.execute(f"""
SELECT
pipeline_name,
date_trunc('day', run_at) AS run_date,
count(*) AS total_runs,
countif(status = 'success') AS success_runs,
avg(duration_s) AS avg_duration_s,
sum(rows_processed) AS total_rows
FROM pipeline_runs
WHERE run_at BETWEEN '{start_date}' AND '{end_date}'
AND pipeline_name IN ({placeholders})
GROUP BY 1, 2
ORDER BY 2 DESC
""").df()
conn.close()
return df
# --- @st.cache_resource ---
# Cache functions that return shared resources.
# The resource is created ONCE and shared across all sessions and reruns.
# NOT copied between calls — return a connection pool or read-only model.
# max_entries= controls how many unique resources are kept in memory.
@st.cache_resource(max_entries=3)
def get_db_connection(db_path: str) -> duckdb.DuckDBPyConnection:
return duckdb.connect(db_path, read_only=True)
@st.cache_resource(show_spinner="Loading ML model...")
def load_anomaly_detector(model_path: str):
import joblib
return joblib.load(model_path)
# Usage in app
conn = get_db_connection("warehouse.ddb")
model = load_anomaly_detector("models/anomaly_detector_v3.pkl")
# Parameters from sidebar (unhashable types must be converted before passing)
start = st.sidebar.date_input("Start").isoformat()
end = st.sidebar.date_input("End").isoformat()
pipelines = tuple(sorted(st.sidebar.multiselect("Pipelines", ["a", "b", "c"])))
df = get_pipeline_metrics(start, end, pipelines)
st.dataframe(df)
# Manual cache invalidation
if st.button("Refresh Data"):
get_pipeline_metrics.clear() # clear all cached results for this function
st.rerun()Note
@st.cache_data functions — lists, dicts, and dataframes are not hashable and will cause a CacheError. Convert lists to tuples and dicts to frozensets before passing. Date objects and strings are hashable. If you need to pass a DataFrame as a parameter, use its hash: hash(df.to_parquet()) and load the actual data inside the function using a path parameter instead.DuckDB Integration — In-Process Analytical Queries
DuckDB and Streamlit are a natural pair for data engineering dashboards. DuckDB's in-process columnar engine can scan Parquet files, S3 paths, and existing pandas DataFrames with full SQL — without a database server, connection pooling overhead, or import/export steps. Combined with @st.cache_resource for the connection and @st.cache_data for query results, you get interactive sub-second analytics over hundreds of millions of rows on a single app server.
import streamlit as st
import duckdb
import pandas as pd
import plotly.express as px
@st.cache_resource
def get_duckdb() -> duckdb.DuckDBPyConnection:
conn = duckdb.connect()
# Install and load the httpfs extension for S3/HTTP Parquet access
conn.execute("INSTALL httpfs; LOAD httpfs;")
conn.execute("""
SET s3_region = 'eu-west-1';
SET s3_access_key_id = '${AWS_ACCESS_KEY_ID}';
SET s3_secret_access_key = '${AWS_SECRET_ACCESS_KEY}';
""")
return conn
@st.cache_data(ttl="5m")
def query_parquet(
s3_path: str,
group_by: str,
metric: str,
limit: int = 50,
) -> pd.DataFrame:
conn = get_duckdb()
return conn.execute(f"""
SELECT
{group_by},
count(*) AS record_count,
sum({metric}) AS total_{metric},
avg({metric}) AS avg_{metric},
percentile_cont(0.95) WITHIN GROUP (ORDER BY {metric}) AS p95_{metric}
FROM read_parquet('{s3_path}/**/*.parquet', hive_partitioning = true)
GROUP BY {group_by}
ORDER BY total_{metric} DESC
LIMIT {limit}
""").df()
# App UI
st.title("Data Lake Explorer")
with st.sidebar:
s3_path = st.text_input("S3 Path", "s3://my-bucket/events/")
group_by_col = st.selectbox("Group By", ["event_type", "country", "platform", "user_segment"])
metric_col = st.selectbox("Metric", ["revenue", "session_duration_s", "page_views"])
row_limit = st.slider("Max rows", 10, 200, 50)
if st.button("Run Query") or "last_result" in st.session_state:
if st.button("Run Query"):
with st.spinner("Querying Parquet files..."):
result = query_parquet(s3_path, group_by_col, metric_col, row_limit)
st.session_state.last_result = result
if "last_result" in st.session_state:
df = st.session_state.last_result
col1, col2 = st.columns([2, 1])
with col1:
fig = px.bar(
df.head(20),
x=group_by_col,
y=f"total_{metric_col}",
color=f"avg_{metric_col}",
color_continuous_scale="Teal",
title=f"Top 20 {group_by_col} by total {metric_col}",
)
fig.update_layout(paper_bgcolor="rgba(0,0,0,0)", plot_bgcolor="rgba(0,0,0,0)", font_color="#e5e7eb")
st.plotly_chart(fig, use_container_width=True)
with col2:
st.dataframe(df, use_container_width=True, hide_index=True)Forms — Batch Widget Submissions
By default, every widget interaction triggers an immediate rerun. For expensive operations like a database write or an API call, this means the operation would fire on every keystroke in a text input. st.form batches all widget interactions inside the form and only triggers a rerun when the submit button is clicked — the same pattern as an HTML form.
import streamlit as st
with st.form("pipeline_config_form"):
st.subheader("New Pipeline Configuration")
pipeline_name = st.text_input("Pipeline name", placeholder="ingest_orders_v2")
schedule = st.selectbox("Schedule", ["@hourly", "@daily", "@weekly", "custom"])
if schedule == "custom":
cron_expr = st.text_input("Cron expression", placeholder="0 2 * * *")
source_conn = st.selectbox("Source connection", ["postgres_prod", "mysql_dwh", "bigquery_export"])
target_schema = st.text_input("Target schema", placeholder="analytics")
enable_alerts = st.checkbox("Enable failure alerts", value=True)
alert_channels = st.multiselect(
"Alert channels",
["slack-data-eng", "pagerduty", "email-oncall"],
disabled=not enable_alerts,
)
submitted = st.form_submit_button("Create Pipeline", type="primary")
if submitted:
if not pipeline_name:
st.error("Pipeline name is required.")
elif not pipeline_name.replace("_", "").isalnum():
st.error("Pipeline name must be alphanumeric with underscores only.")
else:
with st.spinner("Creating pipeline..."):
# Your actual pipeline creation logic here
result = create_pipeline(
name=pipeline_name,
schedule=schedule,
source=source_conn,
target_schema=target_schema,
)
st.success(f"Pipeline '{pipeline_name}' created with ID: {result['id']}")
st.json(result)Multi-Page Apps
Streamlit auto-discovers Python files in a pages/ directory and adds them to the sidebar navigation. File names control display order (numeric prefix) and page title (underscores become spaces). Session state is shared across pages — a filter set on the Overview page is visible on the Detail page. Each page file is a standard Streamlit script with access to the same st.* functions and cached resources.
# pages/2_Pipeline_Health.py
import streamlit as st
import pandas as pd
# Page config applies per-page — overrides app.py defaults for this page only
st.set_page_config(page_title="Pipeline Health", layout="wide")
# Session state set in app.py or other pages is accessible here
if "date_range" not in st.session_state:
st.warning("Please set filters on the Overview page first.")
st.stop() # halt script execution — nothing below runs
start, end = st.session_state.date_range
# Cached resource from app.py (shared across pages in the same session)
from data.queries import get_db_connection, get_pipeline_metrics
conn = get_db_connection("warehouse.ddb")
df = get_pipeline_metrics(start.isoformat(), end.isoformat(), tuple(st.session_state.pipelines))
# Use st.tabs for sub-sections within a page
tab_overview, tab_errors, tab_sla = st.tabs(["Overview", "Error Analysis", "SLA Tracking"])
with tab_overview:
st.dataframe(df, use_container_width=True)
with tab_errors:
failed = df[df["status"] == "failed"]
if failed.empty:
st.success("No failures in the selected period.")
else:
st.error(f"{len(failed)} pipeline failures detected.")
st.dataframe(failed[["pipeline_name", "run_at", "error_message"]], use_container_width=True)
with tab_sla:
sla_target_s = st.number_input("SLA target (seconds)", value=300, min_value=60)
sla_df = df.assign(meets_sla=df["duration_s"] <= sla_target_s)
st.metric("SLA Compliance", f"{sla_df['meets_sla'].mean():.1%}")Real-Time Updates with st.empty and Auto-Refresh
For live monitoring dashboards, st.empty creates a single-element placeholder that can be overwritten in a loop, and time.sleep() controls the polling interval. Use st.fragment (Streamlit 1.33+) to re-run only a portion of the page on a timer without re-executing the full script — significantly reducing CPU and latency for live metric cards.
import streamlit as st
import time
# --- st.empty pattern: overwrite a placeholder in a polling loop ---
auto_refresh = st.toggle("Live mode", value=False)
refresh_interval = st.slider("Refresh interval (s)", 5, 60, 15, disabled=not auto_refresh)
placeholder = st.empty()
while auto_refresh:
with placeholder.container():
df = get_latest_metrics() # your actual query
col1, col2, col3 = st.columns(3)
col1.metric("Active pipelines", df["active"].iloc[0], delta=df["active_delta"].iloc[0])
col2.metric("Failed last hour", df["failed_1h"].iloc[0], delta_color="inverse")
col3.metric("Avg latency", f"{df['avg_latency_s'].iloc[0]:.1f}s")
st.caption(f"Last updated: {pd.Timestamp.now().strftime('%H:%M:%S')}")
time.sleep(refresh_interval)
st.rerun()
# --- st.fragment: re-run a section independently (Streamlit >= 1.33) ---
@st.fragment(run_every="15s")
def live_metric_cards():
df = get_latest_metrics()
col1, col2, col3 = st.columns(3)
col1.metric("Active pipelines", df["active"].iloc[0])
col2.metric("Failed last hour", df["failed_1h"].iloc[0])
col3.metric("Avg latency", f"{df['avg_latency_s'].iloc[0]:.1f}s")
# Call the fragment — it refreshes every 15s independently
live_metric_cards()
# This part of the page does NOT re-run every 15s
st.dataframe(load_historical_trends(), use_container_width=True)ML Model Dashboards
Streamlit is widely used to surface ML model performance and enable exploratory inference. MLflow experiment tracking integrates directly with Streamlit through the MLflow Python client — query runs, compare metrics, and display artifacts in the same app that serves predictions without exporting data to a separate BI tool. The pattern below loads a model with @st.cache_resource, accepts user input through widgets, runs inference, and displays the result with confidence scores and explanations.
import streamlit as st
import mlflow
import pandas as pd
@st.cache_resource
def load_model(model_uri: str):
return mlflow.pyfunc.load_model(model_uri)
@st.cache_data(ttl="1h")
def get_experiment_runs(experiment_name: str) -> pd.DataFrame:
client = mlflow.tracking.MlflowClient()
experiment = client.get_experiment_by_name(experiment_name)
runs = client.search_runs(
experiment_ids=[experiment.experiment_id],
order_by=["metrics.val_f1 DESC"],
max_results=20,
)
return pd.DataFrame([{
"run_id": r.info.run_id[:8],
"model_type": r.data.params.get("model_type", "unknown"),
"val_f1": r.data.metrics.get("val_f1", 0),
"val_precision": r.data.metrics.get("val_precision", 0),
"val_recall": r.data.metrics.get("val_recall", 0),
"run_at": pd.Timestamp(r.info.start_time, unit="ms"),
} for r in runs])
st.title("Churn Prediction — Model Dashboard")
col_model, col_runs = st.columns([1, 2])
with col_model:
st.subheader("Run Inference")
model_uri = st.text_input("Model URI", "models:/churn-predictor/Production")
model = load_model(model_uri)
with st.form("inference_form"):
tenure_months = st.number_input("Tenure (months)", 0, 120, 12)
monthly_charges = st.number_input("Monthly charges ($)", 0.0, 500.0, 65.0)
num_products = st.selectbox("Products subscribed", [1, 2, 3, 4])
has_support = st.checkbox("Has support contract")
predict_btn = st.form_submit_button("Predict Churn Risk")
if predict_btn:
features = pd.DataFrame([{
"tenure_months": tenure_months,
"monthly_charges": monthly_charges,
"num_products": num_products,
"has_support_contract": int(has_support),
}])
proba = model.predict(features)[0]
risk_label = "High" if proba > 0.7 else "Medium" if proba > 0.4 else "Low"
color = "#f87171" if proba > 0.7 else "#fbbf24" if proba > 0.4 else "#4ade80"
st.markdown(f"### Churn Risk: :{color}[{risk_label}]")
st.metric("Churn Probability", f"{proba:.1%}")
with col_runs:
st.subheader("Experiment History")
runs_df = get_experiment_runs("churn-prediction")
st.dataframe(runs_df, use_container_width=True, hide_index=True)Docker and Kubernetes Deployment
For production deployments beyond Streamlit Community Cloud, containerize the app and run it on Kubernetes behind an ingress with authentication. The key configuration points are: mounting secrets via Kubernetes Secrets (not environment variables in the Deployment spec), setting server.headless = true to disable the browser-open behavior, and using a non-root user in the Dockerfile to satisfy pod security policies.
# Dockerfile — multi-stage build for a lean production image
FROM python:3.12-slim AS builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir --user -r requirements.txt
FROM python:3.12-slim AS runtime
# Non-root user for pod security compliance
RUN useradd --create-home --shell /bin/bash appuser
USER appuser
WORKDIR /home/appuser/app
# Copy installed packages from builder
COPY --from=builder /root/.local /home/appuser/.local
ENV PATH=/home/appuser/.local/bin:${PATH}
COPY --chown=appuser:appuser . .
EXPOSE 8501
HEALTHCHECK CMD curl --fail http://localhost:8501/_stcore/health || exit 1
ENTRYPOINT ["streamlit", "run", "app.py", "--server.port=8501", "--server.address=0.0.0.0", "--server.headless=true", "--server.fileWatcherType=none"]# kubernetes/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: pipeline-dashboard
namespace: data-apps
spec:
replicas: 2
selector:
matchLabels:
app: pipeline-dashboard
template:
metadata:
labels:
app: pipeline-dashboard
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1000
containers:
- name: streamlit
image: registry.example.com/pipeline-dashboard:v1.4.2
ports:
- containerPort: 8501
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "2Gi"
cpu: "1000m"
env:
- name: STREAMLIT_SERVER_HEADLESS
value: "true"
volumeMounts:
- name: streamlit-secrets
mountPath: /home/appuser/app/.streamlit/secrets.toml
subPath: secrets.toml
readOnly: true
readinessProbe:
httpGet:
path: /_stcore/health
port: 8501
initialDelaySeconds: 10
periodSeconds: 5
livenessProbe:
httpGet:
path: /_stcore/health
port: 8501
initialDelaySeconds: 30
periodSeconds: 30
volumes:
- name: streamlit-secrets
secret:
secretName: pipeline-dashboard-secrets
---
apiVersion: v1
kind: Service
metadata:
name: pipeline-dashboard
namespace: data-apps
spec:
selector:
app: pipeline-dashboard
ports:
- port: 80
targetPort: 8501
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: pipeline-dashboard
namespace: data-apps
annotations:
nginx.ingress.kubernetes.io/auth-url: "https://oauth2proxy.internal/oauth2/auth"
nginx.ingress.kubernetes.io/auth-signin: "https://oauth2proxy.internal/oauth2/sign_in"
spec:
rules:
- host: pipelines.internal.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: pipeline-dashboard
port:
number: 80# Create the Kubernetes Secret from .streamlit/secrets.toml
kubectl create secret generic pipeline-dashboard-secrets --from-file=secrets.toml=.streamlit/secrets.toml --namespace data-apps --dry-run=client -o yaml | kubectl apply -f -
# .streamlit/secrets.toml structure (gitignored)
[database]
host = "postgres.internal"
port = 5432
database = "warehouse"
username = "dashboard_ro"
password = "..."
[aws]
access_key_id = "..."
secret_access_key = "..."
region = "eu-west-1"
# Access in app.py:
# db_host = st.secrets["database"]["host"]
# Or as a dict: st.secrets["database"]["password"]Production Checklist
Always convert mutable arguments to hashable types before passing to @st.cache_data. Lists must become tuples, dicts must become frozensets of their items. Failing to do so raises a CacheError at runtime. If you cannot avoid a mutable argument, use st.cache_data(hash_funcs={type: custom_hash}) with a custom hash function that produces a stable string from the object's contents.
Use @st.cache_resource for singleton objects: database connections, ML models, and Elasticsearch clients. These objects are created once per worker process and shared across all sessions — creating a new connection per rerun (or per session) will exhaust connection pools within minutes under any meaningful concurrent load.
Set ttl= on @st.cache_data for all data-fetching functions. Without a TTL, cached results live until the app restarts or the cache is manually cleared. A dashboard showing yesterday's pipeline health because the cache never expired is a common production incident. Default to conservative TTLs (5–15 minutes) and make them configurable via st.secrets.
Never put secrets in .streamlit/config.toml or hard-code them in the app. Use .streamlit/secrets.toml locally (gitignored) and a Kubernetes Secret volume mount in production. Access via st.secrets['section']['key'] — Streamlit raises a clear error if a required secret is missing, preventing silent fallback to insecure defaults.
Use st.form for any widget group that triggers a write operation (database insert, API call, pipeline trigger). Without st.form, each widget interaction causes a rerun and the write operation fires on every character typed. st.form batches all widget values and only submits on button click, matching the user's mental model of a form.
Profile memory usage before deploying. A Streamlit app running with 20 concurrent sessions where each session loads a 500 MB DataFrame into @st.cache_data will exhaust 10 GB of memory. Use per-query column projection to load only needed columns, Parquet predicate pushdown to reduce scanned rows, and consider serving aggregated data instead of raw records to the dashboard.
Run at least 2 replicas in Kubernetes for zero-downtime rolling deploys. Session state is in-process and not shared between replicas — sticky sessions via nginx.ingress.kubernetes.io/upstream-hash-by: '$remote_addr' ensure a user's session state stays on the same pod throughout their session, avoiding broken wizard state or lost filter selections mid-interaction.
Add a health check endpoint and configure readinessProbe on /_stcore/health. Kubernetes will not route traffic to a pod that fails the readiness check, preventing users from hitting a pod that is still initialising cached resources (model loading, first DB connection). The liveness probe on the same endpoint restarts pods that become unresponsive after startup.
Use st.fragment (Streamlit 1.33+) for live metric sections that update frequently. Without fragments, auto-refresh reruns the entire script — re-executing all cached function calls, re-rendering all charts, and causing visible flicker. Fragment reruns re-execute only the decorated function body, reducing CPU usage by 60–90% for dashboards that mix static and live content.
Log user interactions for usage analytics. Wrap key actions in try/except and emit structured events: what query was run, what filters were selected, how long the cache miss took. Streamlit has no built-in analytics — without instrumentation you cannot distinguish which dashboard sections are used versus which exist as technical debt.
Your data engineering team shares pipeline health as Jupyter notebooks that only run on the author’s machine, analysts wait for CSV exports to answer ad-hoc questions, or your Streamlit app crashes under concurrent sessions because it loads raw DataFrames into memory without caching?
We build and deploy production Streamlit data applications — from project structure and DuckDB query layer design with st.cache_resource connection singletons and st.cache_data TTL configuration, through multi-page app layout with shared session state across pages, Plotly and Altair chart integration with dark theme overrides, st.form patterns for safe database write operations, st.fragment live monitoring sections with per-section refresh intervals, MLflow experiment and model registry integration for inference dashboards, Docker multi-stage builds with non-root users and health endpoints, Kubernetes Deployment manifests with Secret volume mounts for credentials, readiness and liveness probe configuration, nginx sticky session ingress annotations for multi-replica deployments, and production monitoring for memory usage and cache hit rates. Let’s talk.
Let's Talk