What industries do you work with?

We work across a wide range of industries including finance, healthcare, e-commerce, logistics, and telecommunications. Our solutions are tailored to each client’s specific domain requirements and regulatory environment.

How long does a typical engagement take?

It depends on the scope. A focused observability deployment or automation workflow can be delivered in 4-6 weeks. Larger initiatives like full-scale LLM integration or platform builds typically run 2-4 months. We always start with a discovery phase to align on timelines.

Do you offer ongoing support after project delivery?

Yes. We offer flexible support and maintenance plans to ensure your systems stay healthy, updated, and optimized. We can also embed with your team on a part-time basis for continuous improvement.

Can you work with our existing tech stack?

Absolutely. We integrate with your current infrastructure and tools rather than forcing a rip-and-replace. Whether you’re on AWS, GCP, Azure, or on-prem, we adapt our approach to what works best for your environment.

What is your pricing model?

We offer both fixed-price project engagements and time-and-materials contracts depending on the nature of the work. Reach out through our contact form and we’ll provide a tailored estimate within 24 hours.

How do you handle data security and compliance?

Security is built into every engagement. We follow industry best practices for data handling, support GDPR and SOC 2 compliance requirements, and can work within your existing security policies and access controls.

Streamlit for Data Engineers — Interactive Dashboards, Caching, and Deployment Patterns

Why Streamlit Fits Data Engineering Work

Data engineers spend most of their time in Python — writing pipeline code, running analytical queries, and debugging data quality issues. Sharing results typically means either a Jupyter notebook (not reproducible without the right kernel), a static CSV (no interactivity), or a Jira comment with a screenshot. Streamlit closes this gap: it converts a plain Python script into a live web application by re-running the script top-to-bottom on every widget interaction. No HTML, no JavaScript, no deployment ceremony for prototypes.

The model is intentionally simple. A Streamlit app is a Python file. Running streamlit run app.py starts a local server. Every slider, selectbox, or button that the user interacts with causes the script to re-execute, and the framework diffs the output to update only the changed widgets. This execution model means you write procedural Python — no callbacks, no reactive graph, no component lifecycle — and the framework handles reactivity. Where Evidence.dev targets code-driven analytics reports with a SQL-first notebook model — Streamlit targets Python-native teams that need full programmatic control over layout, data transformations, and integrations.

Python-Native

Write apps with the same libraries you use in pipelines — pandas, polars, DuckDB, scikit-learn. No DSL, no template engine, no framework lock-in beyond Streamlit itself.

Incremental

Start with st.write(df) and add interactivity incrementally. Every component follows the same pattern: call a function, get a value, use it in the next line.

Deployable

Deploy to Streamlit Community Cloud for free with a GitHub URL, or containerize with Docker and run on Kubernetes for production workloads requiring authentication and resource controls.

Installation and Project Structure

Streamlit is a single PyPI package. Python 3.9+ is required. The recommended project structure separates app logic from data access and keeps secrets out of version control via .streamlit/secrets.toml.

pip install streamlit

# Verify installation
streamlit hello

# Run your app (auto-reloads on file save)
streamlit run app.py

# Run on a specific port (useful in containers)
streamlit run app.py --server.port 8501 --server.address 0.0.0.0

# Recommended project layout
myapp/
├── app.py                    # entry point
├── pages/                    # multi-page apps (auto-discovered)
│   ├── 1_Overview.py
│   ├── 2_Pipeline_Health.py
│   └── 3_Data_Quality.py
├── components/               # reusable chart/widget helpers
│   ├── charts.py
│   └── filters.py
├── data/                     # data access layer
│   └── queries.py
├── .streamlit/
│   ├── config.toml           # theme and server settings (committed)
│   └── secrets.toml          # credentials (gitignored)
├── requirements.txt
└── Dockerfile

# .streamlit/config.toml — committed to version control
[theme]
base = "dark"
primaryColor = "#22d3ee"      # accent colour matching your brand
backgroundColor = "#0A0A0A"
secondaryBackgroundColor = "#111111"
textColor = "#e5e7eb"
font = "monospace"

[server]
headless = true               # required for containerised deployments
enableCORS = false
port = 8501

[runner]
magicEnabled = false          # disable implicit st.write() for explicit control

Core Data Components — DataFrames, Charts, and Widgets

Streamlit's data display primitives work directly with pandas DataFrames, numpy arrays, and dictionaries. st.dataframe() renders an interactive sortable and filterable table; st.table() renders a static version. For charts, the recommended path for data engineers is Plotly (full control) or Altair (declarative grammar). The built-in st.line_chart() and st.bar_chart() are useful for quick iteration but lack axis labels and custom theming.

import streamlit as st
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go

st.set_page_config(
    page_title="Pipeline Health Dashboard",
    page_icon="📊",
    layout="wide",
    initial_sidebar_state="expanded",
)

# --- Sidebar filters ---
with st.sidebar:
    st.header("Filters")
    date_range = st.date_input("Date range", value=[])
    pipeline_names = st.multiselect(
        "Pipelines",
        options=["ingest_orders", "transform_users", "load_analytics", "export_reports"],
        default=["ingest_orders", "transform_users"],
    )
    status_filter = st.radio("Status", ["All", "Failed", "Success", "Running"], index=0)

# --- DataFrame display ---
@st.cache_data(ttl=300)
def load_pipeline_runs(pipelines: list[str]) -> pd.DataFrame:
    # Replace with your actual data source
    import numpy as np
    rng = np.random.default_rng(42)
    rows = []
    for p in pipelines:
        for i in range(20):
            rows.append({
                "pipeline": p,
                "run_id": f"{p}_{i:04d}",
                "status": rng.choice(["success", "failed", "running"], p=[0.8, 0.15, 0.05]),
                "duration_s": int(rng.exponential(120)),
                "rows_processed": int(rng.exponential(50_000)),
                "started_at": pd.Timestamp("2026-06-01") + pd.Timedelta(hours=int(i * 6)),
            })
    return pd.DataFrame(rows)

df = load_pipeline_runs(pipeline_names)

# Apply status filter
if status_filter != "All":
    df = df[df["status"] == status_filter.lower()]

# Metrics row
col1, col2, col3, col4 = st.columns(4)
col1.metric("Total Runs", len(df))
col2.metric("Success Rate", f"{(df['status'] == 'success').mean():.1%}")
col3.metric("Avg Duration", f"{df['duration_s'].mean():.0f}s")
col4.metric("Rows Processed", f"{df['rows_processed'].sum():,.0f}")

st.divider()

# Interactive DataFrame with column configuration
st.subheader("Pipeline Runs")
st.dataframe(
    df,
    column_config={
        "status": st.column_config.SelectboxColumn(
            "Status",
            options=["success", "failed", "running"],
        ),
        "duration_s": st.column_config.NumberColumn("Duration (s)", format="%d s"),
        "rows_processed": st.column_config.NumberColumn("Rows", format="%,d"),
        "started_at": st.column_config.DatetimeColumn("Started", format="YYYY-MM-DD HH:mm"),
    },
    use_container_width=True,
    hide_index=True,
)

# Plotly chart
fig = px.histogram(
    df,
    x="duration_s",
    color="status",
    color_discrete_map={"success": "#22d3ee", "failed": "#f87171", "running": "#a3a3a3"},
    nbins=30,
    title="Run Duration Distribution",
    labels={"duration_s": "Duration (seconds)", "count": "Runs"},
)
fig.update_layout(
    paper_bgcolor="rgba(0,0,0,0)",
    plot_bgcolor="rgba(0,0,0,0)",
    font_color="#e5e7eb",
)
st.plotly_chart(fig, use_container_width=True)

Session State — Persisting Values Across Reruns

Because Streamlit re-runs the entire script on every widget interaction, local variables do not persist between runs. st.session_state is a dictionary-like object scoped to a browser session that survives reruns. It is the correct place to accumulate user selections, track multi-step wizard state, or store the results of expensive one-shot operations like an initial data load triggered by a button click.

import streamlit as st

# --- Pattern 1: Initialise with a default ---
if "page" not in st.session_state:
    st.session_state.page = "overview"
if "selected_run_id" not in st.session_state:
    st.session_state.selected_run_id = None
if "query_history" not in st.session_state:
    st.session_state.query_history = []

# --- Pattern 2: Button-triggered one-shot operation ---
# The button returns True only on the click rerun; the result is stored in session_state
if st.button("Run Full Scan"):
    with st.spinner("Scanning pipeline metadata..."):
        # This only executes once per button click
        result = run_expensive_scan()  # your actual function
        st.session_state.scan_result = result
        st.session_state.scan_ran_at = pd.Timestamp.now()

if "scan_result" in st.session_state:
    st.success(f"Scan completed at {st.session_state.scan_ran_at}")
    st.dataframe(st.session_state.scan_result)

# --- Pattern 3: Multi-step wizard ---
STEPS = ["Select Source", "Configure Transform", "Preview", "Confirm"]

step_idx = st.session_state.get("wizard_step", 0)
st.progress((step_idx + 1) / len(STEPS), text=f"Step {step_idx + 1}: {STEPS[step_idx]}")

if step_idx == 0:
    source = st.selectbox("Source table", ["orders", "customers", "products"])
    if st.button("Next →"):
        st.session_state.wizard_source = source
        st.session_state.wizard_step = 1
        st.rerun()

elif step_idx == 1:
    st.write(f"Configuring transform for: **{st.session_state.wizard_source}**")
    agg_col = st.selectbox("Aggregate by", ["day", "week", "month"])
    col1, col2 = st.columns(2)
    if col1.button("← Back"):
        st.session_state.wizard_step = 0
        st.rerun()
    if col2.button("Next →"):
        st.session_state.wizard_agg = agg_col
        st.session_state.wizard_step = 2
        st.rerun()

Note

st.rerun() (formerly st.experimental_rerun()) immediately stops the current script execution and triggers a fresh run. Use it sparingly — it causes a full rerun which re-executes all cached functions and widget calls. It is the correct tool for wizard navigation and post-form-submission redirects, but using it after every state mutation leads to double reruns and confusing behavior.

Caching — `@st.cache_data` and `@st.cache_resource`

Because every widget interaction triggers a full script rerun, expensive operations — database queries, file reads, model inference — will execute on every interaction without caching. Streamlit provides two caching decorators that cover almost all production use cases: @st.cache_data for functions that return data (DataFrames, lists, dicts) and @st.cache_resource for functions that return shared resources like database connections and ML models that should not be copied between sessions.

import streamlit as st
import pandas as pd
import duckdb

# --- @st.cache_data ---
# Cache functions that transform and return data.
# Results are serialised (pickled) and stored per unique set of arguments.
# Each call with the same arguments returns a deep copy — safe for mutation.
# ttl= controls expiry: "10m", "1h", 3600 (seconds), or timedelta.

@st.cache_data(ttl="10m", show_spinner="Loading pipeline metrics...")
def get_pipeline_metrics(
    start_date: str,
    end_date: str,
    pipeline_names: tuple[str, ...],  # use tuple, not list — lists are not hashable
) -> pd.DataFrame:
    conn = duckdb.connect("metrics.ddb", read_only=True)
    placeholders = ", ".join(f"'{p}'" for p in pipeline_names)
    df = conn.execute(f"""
        SELECT
            pipeline_name,
            date_trunc('day', run_at) AS run_date,
            count(*) AS total_runs,
            countif(status = 'success') AS success_runs,
            avg(duration_s) AS avg_duration_s,
            sum(rows_processed) AS total_rows
        FROM pipeline_runs
        WHERE run_at BETWEEN '{start_date}' AND '{end_date}'
          AND pipeline_name IN ({placeholders})
        GROUP BY 1, 2
        ORDER BY 2 DESC
    """).df()
    conn.close()
    return df

# --- @st.cache_resource ---
# Cache functions that return shared resources.
# The resource is created ONCE and shared across all sessions and reruns.
# NOT copied between calls — return a connection pool or read-only model.
# max_entries= controls how many unique resources are kept in memory.

@st.cache_resource(max_entries=3)
def get_db_connection(db_path: str) -> duckdb.DuckDBPyConnection:
    return duckdb.connect(db_path, read_only=True)

@st.cache_resource(show_spinner="Loading ML model...")
def load_anomaly_detector(model_path: str):
    import joblib
    return joblib.load(model_path)

# Usage in app
conn = get_db_connection("warehouse.ddb")
model = load_anomaly_detector("models/anomaly_detector_v3.pkl")

# Parameters from sidebar (unhashable types must be converted before passing)
start = st.sidebar.date_input("Start").isoformat()
end = st.sidebar.date_input("End").isoformat()
pipelines = tuple(sorted(st.sidebar.multiselect("Pipelines", ["a", "b", "c"])))

df = get_pipeline_metrics(start, end, pipelines)
st.dataframe(df)

# Manual cache invalidation
if st.button("Refresh Data"):
    get_pipeline_metrics.clear()  # clear all cached results for this function
    st.rerun()

Note

Pass only hashable arguments to @st.cache_data functions — lists, dicts, and dataframes are not hashable and will cause a CacheError. Convert lists to tuples and dicts to frozensets before passing. Date objects and strings are hashable. If you need to pass a DataFrame as a parameter, use its hash: hash(df.to_parquet()) and load the actual data inside the function using a path parameter instead.

DuckDB Integration — In-Process Analytical Queries

DuckDB and Streamlit are a natural pair for data engineering dashboards. DuckDB's in-process columnar engine can scan Parquet files, S3 paths, and existing pandas DataFrames with full SQL — without a database server, connection pooling overhead, or import/export steps. Combined with @st.cache_resource for the connection and @st.cache_data for query results, you get interactive sub-second analytics over hundreds of millions of rows on a single app server.

import streamlit as st
import duckdb
import pandas as pd
import plotly.express as px

@st.cache_resource
def get_duckdb() -> duckdb.DuckDBPyConnection:
    conn = duckdb.connect()
    # Install and load the httpfs extension for S3/HTTP Parquet access
    conn.execute("INSTALL httpfs; LOAD httpfs;")
    conn.execute("""
        SET s3_region = 'eu-west-1';
        SET s3_access_key_id = '${AWS_ACCESS_KEY_ID}';
        SET s3_secret_access_key = '${AWS_SECRET_ACCESS_KEY}';
    """)
    return conn

@st.cache_data(ttl="5m")
def query_parquet(
    s3_path: str,
    group_by: str,
    metric: str,
    limit: int = 50,
) -> pd.DataFrame:
    conn = get_duckdb()
    return conn.execute(f"""
        SELECT
            {group_by},
            count(*) AS record_count,
            sum({metric}) AS total_{metric},
            avg({metric}) AS avg_{metric},
            percentile_cont(0.95) WITHIN GROUP (ORDER BY {metric}) AS p95_{metric}
        FROM read_parquet('{s3_path}/**/*.parquet', hive_partitioning = true)
        GROUP BY {group_by}
        ORDER BY total_{metric} DESC
        LIMIT {limit}
    """).df()

# App UI
st.title("Data Lake Explorer")

with st.sidebar:
    s3_path = st.text_input("S3 Path", "s3://my-bucket/events/")
    group_by_col = st.selectbox("Group By", ["event_type", "country", "platform", "user_segment"])
    metric_col = st.selectbox("Metric", ["revenue", "session_duration_s", "page_views"])
    row_limit = st.slider("Max rows", 10, 200, 50)

if st.button("Run Query") or "last_result" in st.session_state:
    if st.button("Run Query"):
        with st.spinner("Querying Parquet files..."):
            result = query_parquet(s3_path, group_by_col, metric_col, row_limit)
            st.session_state.last_result = result

    if "last_result" in st.session_state:
        df = st.session_state.last_result
        col1, col2 = st.columns([2, 1])

        with col1:
            fig = px.bar(
                df.head(20),
                x=group_by_col,
                y=f"total_{metric_col}",
                color=f"avg_{metric_col}",
                color_continuous_scale="Teal",
                title=f"Top 20 {group_by_col} by total {metric_col}",
            )
            fig.update_layout(paper_bgcolor="rgba(0,0,0,0)", plot_bgcolor="rgba(0,0,0,0)", font_color="#e5e7eb")
            st.plotly_chart(fig, use_container_width=True)

        with col2:
            st.dataframe(df, use_container_width=True, hide_index=True)

Forms — Batch Widget Submissions

By default, every widget interaction triggers an immediate rerun. For expensive operations like a database write or an API call, this means the operation would fire on every keystroke in a text input. st.form batches all widget interactions inside the form and only triggers a rerun when the submit button is clicked — the same pattern as an HTML form.

import streamlit as st

with st.form("pipeline_config_form"):
    st.subheader("New Pipeline Configuration")

    pipeline_name = st.text_input("Pipeline name", placeholder="ingest_orders_v2")
    schedule = st.selectbox("Schedule", ["@hourly", "@daily", "@weekly", "custom"])

    if schedule == "custom":
        cron_expr = st.text_input("Cron expression", placeholder="0 2 * * *")

    source_conn = st.selectbox("Source connection", ["postgres_prod", "mysql_dwh", "bigquery_export"])
    target_schema = st.text_input("Target schema", placeholder="analytics")
    enable_alerts = st.checkbox("Enable failure alerts", value=True)
    alert_channels = st.multiselect(
        "Alert channels",
        ["slack-data-eng", "pagerduty", "email-oncall"],
        disabled=not enable_alerts,
    )

    submitted = st.form_submit_button("Create Pipeline", type="primary")

if submitted:
    if not pipeline_name:
        st.error("Pipeline name is required.")
    elif not pipeline_name.replace("_", "").isalnum():
        st.error("Pipeline name must be alphanumeric with underscores only.")
    else:
        with st.spinner("Creating pipeline..."):
            # Your actual pipeline creation logic here
            result = create_pipeline(
                name=pipeline_name,
                schedule=schedule,
                source=source_conn,
                target_schema=target_schema,
            )
        st.success(f"Pipeline '{pipeline_name}' created with ID: {result['id']}")
        st.json(result)

Multi-Page Apps

Streamlit auto-discovers Python files in a pages/ directory and adds them to the sidebar navigation. File names control display order (numeric prefix) and page title (underscores become spaces). Session state is shared across pages — a filter set on the Overview page is visible on the Detail page. Each page file is a standard Streamlit script with access to the same st.* functions and cached resources.

# pages/2_Pipeline_Health.py
import streamlit as st
import pandas as pd

# Page config applies per-page — overrides app.py defaults for this page only
st.set_page_config(page_title="Pipeline Health", layout="wide")

# Session state set in app.py or other pages is accessible here
if "date_range" not in st.session_state:
    st.warning("Please set filters on the Overview page first.")
    st.stop()  # halt script execution — nothing below runs

start, end = st.session_state.date_range

# Cached resource from app.py (shared across pages in the same session)
from data.queries import get_db_connection, get_pipeline_metrics
conn = get_db_connection("warehouse.ddb")
df = get_pipeline_metrics(start.isoformat(), end.isoformat(), tuple(st.session_state.pipelines))

# Use st.tabs for sub-sections within a page
tab_overview, tab_errors, tab_sla = st.tabs(["Overview", "Error Analysis", "SLA Tracking"])

with tab_overview:
    st.dataframe(df, use_container_width=True)

with tab_errors:
    failed = df[df["status"] == "failed"]
    if failed.empty:
        st.success("No failures in the selected period.")
    else:
        st.error(f"{len(failed)} pipeline failures detected.")
        st.dataframe(failed[["pipeline_name", "run_at", "error_message"]], use_container_width=True)

with tab_sla:
    sla_target_s = st.number_input("SLA target (seconds)", value=300, min_value=60)
    sla_df = df.assign(meets_sla=df["duration_s"] <= sla_target_s)
    st.metric("SLA Compliance", f"{sla_df['meets_sla'].mean():.1%}")

Real-Time Updates with `st.empty` and Auto-Refresh

For live monitoring dashboards, st.empty creates a single-element placeholder that can be overwritten in a loop, and time.sleep() controls the polling interval. Use st.fragment (Streamlit 1.33+) to re-run only a portion of the page on a timer without re-executing the full script — significantly reducing CPU and latency for live metric cards.

import streamlit as st
import time

# --- st.empty pattern: overwrite a placeholder in a polling loop ---
auto_refresh = st.toggle("Live mode", value=False)
refresh_interval = st.slider("Refresh interval (s)", 5, 60, 15, disabled=not auto_refresh)

placeholder = st.empty()

while auto_refresh:
    with placeholder.container():
        df = get_latest_metrics()  # your actual query
        col1, col2, col3 = st.columns(3)
        col1.metric("Active pipelines", df["active"].iloc[0], delta=df["active_delta"].iloc[0])
        col2.metric("Failed last hour", df["failed_1h"].iloc[0], delta_color="inverse")
        col3.metric("Avg latency", f"{df['avg_latency_s'].iloc[0]:.1f}s")
        st.caption(f"Last updated: {pd.Timestamp.now().strftime('%H:%M:%S')}")
    time.sleep(refresh_interval)
    st.rerun()

# --- st.fragment: re-run a section independently (Streamlit >= 1.33) ---
@st.fragment(run_every="15s")
def live_metric_cards():
    df = get_latest_metrics()
    col1, col2, col3 = st.columns(3)
    col1.metric("Active pipelines", df["active"].iloc[0])
    col2.metric("Failed last hour", df["failed_1h"].iloc[0])
    col3.metric("Avg latency", f"{df['avg_latency_s'].iloc[0]:.1f}s")

# Call the fragment — it refreshes every 15s independently
live_metric_cards()

# This part of the page does NOT re-run every 15s
st.dataframe(load_historical_trends(), use_container_width=True)

ML Model Dashboards

Streamlit is widely used to surface ML model performance and enable exploratory inference. MLflow experiment tracking integrates directly with Streamlit through the MLflow Python client — query runs, compare metrics, and display artifacts in the same app that serves predictions without exporting data to a separate BI tool. The pattern below loads a model with @st.cache_resource, accepts user input through widgets, runs inference, and displays the result with confidence scores and explanations.

import streamlit as st
import mlflow
import pandas as pd

@st.cache_resource
def load_model(model_uri: str):
    return mlflow.pyfunc.load_model(model_uri)

@st.cache_data(ttl="1h")
def get_experiment_runs(experiment_name: str) -> pd.DataFrame:
    client = mlflow.tracking.MlflowClient()
    experiment = client.get_experiment_by_name(experiment_name)
    runs = client.search_runs(
        experiment_ids=[experiment.experiment_id],
        order_by=["metrics.val_f1 DESC"],
        max_results=20,
    )
    return pd.DataFrame([{
        "run_id": r.info.run_id[:8],
        "model_type": r.data.params.get("model_type", "unknown"),
        "val_f1": r.data.metrics.get("val_f1", 0),
        "val_precision": r.data.metrics.get("val_precision", 0),
        "val_recall": r.data.metrics.get("val_recall", 0),
        "run_at": pd.Timestamp(r.info.start_time, unit="ms"),
    } for r in runs])

st.title("Churn Prediction — Model Dashboard")

col_model, col_runs = st.columns([1, 2])

with col_model:
    st.subheader("Run Inference")
    model_uri = st.text_input("Model URI", "models:/churn-predictor/Production")
    model = load_model(model_uri)

    with st.form("inference_form"):
        tenure_months = st.number_input("Tenure (months)", 0, 120, 12)
        monthly_charges = st.number_input("Monthly charges ($)", 0.0, 500.0, 65.0)
        num_products = st.selectbox("Products subscribed", [1, 2, 3, 4])
        has_support = st.checkbox("Has support contract")
        predict_btn = st.form_submit_button("Predict Churn Risk")

    if predict_btn:
        features = pd.DataFrame([{
            "tenure_months": tenure_months,
            "monthly_charges": monthly_charges,
            "num_products": num_products,
            "has_support_contract": int(has_support),
        }])
        proba = model.predict(features)[0]
        risk_label = "High" if proba > 0.7 else "Medium" if proba > 0.4 else "Low"
        color = "#f87171" if proba > 0.7 else "#fbbf24" if proba > 0.4 else "#4ade80"
        st.markdown(f"### Churn Risk: :{color}[{risk_label}]")
        st.metric("Churn Probability", f"{proba:.1%}")

with col_runs:
    st.subheader("Experiment History")
    runs_df = get_experiment_runs("churn-prediction")
    st.dataframe(runs_df, use_container_width=True, hide_index=True)

Docker and Kubernetes Deployment

For production deployments beyond Streamlit Community Cloud, containerize the app and run it on Kubernetes behind an ingress with authentication. The key configuration points are: mounting secrets via Kubernetes Secrets (not environment variables in the Deployment spec), setting server.headless = true to disable the browser-open behavior, and using a non-root user in the Dockerfile to satisfy pod security policies.

# Dockerfile — multi-stage build for a lean production image
FROM python:3.12-slim AS builder

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir --user -r requirements.txt

FROM python:3.12-slim AS runtime

# Non-root user for pod security compliance
RUN useradd --create-home --shell /bin/bash appuser
USER appuser
WORKDIR /home/appuser/app

# Copy installed packages from builder
COPY --from=builder /root/.local /home/appuser/.local
ENV PATH=/home/appuser/.local/bin:${PATH}

COPY --chown=appuser:appuser . .

EXPOSE 8501
HEALTHCHECK CMD curl --fail http://localhost:8501/_stcore/health || exit 1

ENTRYPOINT ["streamlit", "run", "app.py",     "--server.port=8501",     "--server.address=0.0.0.0",     "--server.headless=true",     "--server.fileWatcherType=none"]

# kubernetes/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: pipeline-dashboard
  namespace: data-apps
spec:
  replicas: 2
  selector:
    matchLabels:
      app: pipeline-dashboard
  template:
    metadata:
      labels:
        app: pipeline-dashboard
    spec:
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
      containers:
        - name: streamlit
          image: registry.example.com/pipeline-dashboard:v1.4.2
          ports:
            - containerPort: 8501
          resources:
            requests:
              memory: "512Mi"
              cpu: "250m"
            limits:
              memory: "2Gi"
              cpu: "1000m"
          env:
            - name: STREAMLIT_SERVER_HEADLESS
              value: "true"
          volumeMounts:
            - name: streamlit-secrets
              mountPath: /home/appuser/app/.streamlit/secrets.toml
              subPath: secrets.toml
              readOnly: true
          readinessProbe:
            httpGet:
              path: /_stcore/health
              port: 8501
            initialDelaySeconds: 10
            periodSeconds: 5
          livenessProbe:
            httpGet:
              path: /_stcore/health
              port: 8501
            initialDelaySeconds: 30
            periodSeconds: 30
      volumes:
        - name: streamlit-secrets
          secret:
            secretName: pipeline-dashboard-secrets
---
apiVersion: v1
kind: Service
metadata:
  name: pipeline-dashboard
  namespace: data-apps
spec:
  selector:
    app: pipeline-dashboard
  ports:
    - port: 80
      targetPort: 8501
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: pipeline-dashboard
  namespace: data-apps
  annotations:
    nginx.ingress.kubernetes.io/auth-url: "https://oauth2proxy.internal/oauth2/auth"
    nginx.ingress.kubernetes.io/auth-signin: "https://oauth2proxy.internal/oauth2/sign_in"
spec:
  rules:
    - host: pipelines.internal.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: pipeline-dashboard
                port:
                  number: 80

# Create the Kubernetes Secret from .streamlit/secrets.toml
kubectl create secret generic pipeline-dashboard-secrets   --from-file=secrets.toml=.streamlit/secrets.toml   --namespace data-apps   --dry-run=client -o yaml | kubectl apply -f -

# .streamlit/secrets.toml structure (gitignored)
[database]
host = "postgres.internal"
port = 5432
database = "warehouse"
username = "dashboard_ro"
password = "..."

[aws]
access_key_id = "..."
secret_access_key = "..."
region = "eu-west-1"

# Access in app.py:
# db_host = st.secrets["database"]["host"]
# Or as a dict: st.secrets["database"]["password"]

Production Checklist

Always convert mutable arguments to hashable types before passing to @st.cache_data. Lists must become tuples, dicts must become frozensets of their items. Failing to do so raises a CacheError at runtime. If you cannot avoid a mutable argument, use st.cache_data(hash_funcs={type: custom_hash}) with a custom hash function that produces a stable string from the object's contents.

Use @st.cache_resource for singleton objects: database connections, ML models, and Elasticsearch clients. These objects are created once per worker process and shared across all sessions — creating a new connection per rerun (or per session) will exhaust connection pools within minutes under any meaningful concurrent load.

Set ttl= on @st.cache_data for all data-fetching functions. Without a TTL, cached results live until the app restarts or the cache is manually cleared. A dashboard showing yesterday's pipeline health because the cache never expired is a common production incident. Default to conservative TTLs (5–15 minutes) and make them configurable via st.secrets.

Never put secrets in .streamlit/config.toml or hard-code them in the app. Use .streamlit/secrets.toml locally (gitignored) and a Kubernetes Secret volume mount in production. Access via st.secrets['section']['key'] — Streamlit raises a clear error if a required secret is missing, preventing silent fallback to insecure defaults.

Use st.form for any widget group that triggers a write operation (database insert, API call, pipeline trigger). Without st.form, each widget interaction causes a rerun and the write operation fires on every character typed. st.form batches all widget values and only submits on button click, matching the user's mental model of a form.

Profile memory usage before deploying. A Streamlit app running with 20 concurrent sessions where each session loads a 500 MB DataFrame into @st.cache_data will exhaust 10 GB of memory. Use per-query column projection to load only needed columns, Parquet predicate pushdown to reduce scanned rows, and consider serving aggregated data instead of raw records to the dashboard.

Run at least 2 replicas in Kubernetes for zero-downtime rolling deploys. Session state is in-process and not shared between replicas — sticky sessions via nginx.ingress.kubernetes.io/upstream-hash-by: '$remote_addr' ensure a user's session state stays on the same pod throughout their session, avoiding broken wizard state or lost filter selections mid-interaction.

Add a health check endpoint and configure readinessProbe on /_stcore/health. Kubernetes will not route traffic to a pod that fails the readiness check, preventing users from hitting a pod that is still initialising cached resources (model loading, first DB connection). The liveness probe on the same endpoint restarts pods that become unresponsive after startup.

Use st.fragment (Streamlit 1.33+) for live metric sections that update frequently. Without fragments, auto-refresh reruns the entire script — re-executing all cached function calls, re-rendering all charts, and causing visible flicker. Fragment reruns re-execute only the decorated function body, reducing CPU usage by 60–90% for dashboards that mix static and live content.

Log user interactions for usage analytics. Wrap key actions in try/except and emit structured events: what query was run, what filters were selected, how long the cache miss took. Streamlit has no built-in analytics — without instrumentation you cannot distinguish which dashboard sections are used versus which exist as technical debt.

Streamlit Docs →GitHub — streamlit/streamlit →App Gallery →

Your data engineering team shares pipeline health as Jupyter notebooks that only run on the author’s machine, analysts wait for CSV exports to answer ad-hoc questions, or your Streamlit app crashes under concurrent sessions because it loads raw DataFrames into memory without caching?

We build and deploy production Streamlit data applications — from project structure and DuckDB query layer design with st.cache_resource connection singletons and st.cache_data TTL configuration, through multi-page app layout with shared session state across pages, Plotly and Altair chart integration with dark theme overrides, st.form patterns for safe database write operations, st.fragment live monitoring sections with per-section refresh intervals, MLflow experiment and model registry integration for inference dashboards, Docker multi-stage builds with non-root users and health endpoints, Kubernetes Deployment manifests with Secret volume mounts for credentials, readiness and liveness probe configuration, nginx sticky session ingress annotations for multi-replica deployments, and production monitoring for memory usage and cache hit rates. Let’s talk.

Let's Talk

Streamlit for Data Engineers — Interactive Dashboards, Caching, and Deployment Patterns

Why Streamlit Fits Data Engineering Work

Installation and Project Structure

Core Data Components — DataFrames, Charts, and Widgets

Session State — Persisting Values Across Reruns

Caching — `@st.cache_data` and `@st.cache_resource`

DuckDB Integration — In-Process Analytical Queries

Forms — Batch Widget Submissions

Multi-Page Apps

Real-Time Updates with `st.empty` and Auto-Refresh

ML Model Dashboards

Docker and Kubernetes Deployment

Production Checklist

Your data engineering team shares pipeline health as Jupyter notebooks that only run on the author’s machine, analysts wait for CSV exports to answer ad-hoc questions, or your Streamlit app crashes under concurrent sessions because it loads raw DataFrames into memory without caching?

Related Articles

Need help implementing this in production?

Streamlit for Data Engineers — Interactive Dashboards, Caching, and Deployment Patterns

Why Streamlit Fits Data Engineering Work

Installation and Project Structure

Core Data Components — DataFrames, Charts, and Widgets

Session State — Persisting Values Across Reruns

Caching — @st.cache_data and @st.cache_resource

DuckDB Integration — In-Process Analytical Queries

Forms — Batch Widget Submissions

Multi-Page Apps

Real-Time Updates with st.empty and Auto-Refresh

ML Model Dashboards

Docker and Kubernetes Deployment

Production Checklist

Your data engineering team shares pipeline health as Jupyter notebooks that only run on the author’s machine, analysts wait for CSV exports to answer ad-hoc questions, or your Streamlit app crashes under concurrent sessions because it loads raw DataFrames into memory without caching?

Related Articles

Need help implementing this in production?

Caching — `@st.cache_data` and `@st.cache_resource`

Real-Time Updates with `st.empty` and Auto-Refresh