Back to Blog
Monte CarloData ObservabilityData QualityAnomaly DetectionData LineageData EngineeringMonitoringSLAdbtAirflow

Data Observability with Monte Carlo — Anomaly Detection, Lineage, and SLAs

A practical guide to data observability with Monte Carlo: the five pillars of freshness, volume, distribution, schema, and lineage that together provide complete coverage of data health, ML-based anomaly detection that trains per-table time-series models accounting for daily and weekly seasonality with sensitivity controls from low (1.5σ) to high (0.25σ) for tuning alert precision versus recall, custom metric monitors using the pycarlo SDK for threshold rules like minimum row counts and field health monitors for non-negative revenue columns, field-level lineage graph construction from warehouse query history parsing SQL SELECT expressions across CTEs, subqueries, UNION statements, and window functions to link raw source columns through transformation layers to BI dashboards and reverse ETL targets, data SLA configuration with ISO 8601 breach windows for freshness and volume agreements with PagerDuty and Slack routing, dbt artifact ingestion via the pycarlo dbt helper uploading manifest.json and run_results.json to enrich the lineage graph with model owner, tags, and test outcomes, Airflow integration with MonteCarloIncidentCallback on_failure_callback that creates incidents linked to downstream table lineage when DAG tasks fail, circuit breaker sensors that block downstream DAG execution by polling Monte Carlo table health status until freshness, volume, and field health checks all pass, Snowflake role creation with IMPORTED PRIVILEGES on the SNOWFLAKE database for metadata-only access and BigQuery Terraform service account with resourceViewer and dataViewer IAM bindings, and a 10-point production checklist covering minimum-permission service account configuration, field-level lineage rollout strategy for high-query-volume warehouses, per-table sensitivity calibration, business-aligned SLA window configuration independent of pipeline schedules, unconditional dbt artifact upload, domain-based notification routing with severity tiers, incident response runbook anchored on lineage-first root cause analysis, circuit breaker timeout modeling against end-to-end SLA windows, monthly coverage review for newly created tables, and MTTD and MTTR tracking as primary observability program KPIs.

2026-07-02

Why Traditional Data Monitoring Falls Short

Infrastructure monitoring — CPU, memory, query latency, job success rates — tells you whether your pipelines ran, not whether the data they produced is correct. A pipeline can complete successfully and write 40% fewer rows than yesterday, silently drop a dimension value, or calculate revenue figures an order of magnitude too high due to a fanout join. None of these anomalies trigger an infrastructure alert.

Data observability closes this gap by monitoring the data itself: its freshness, volume, statistical distributions, schema structure, and lineage. Monte Carlo operationalizes these five pillars with ML-based anomaly detection that learns normal behavior per table rather than requiring you to hand-configure thresholds for thousands of tables. The data quality observability guide covers complementary approaches including dbt tests, Great Expectations contracts, and freshness SLAs that integrate alongside Monte Carlo — the two layers catch different failure modes and are most effective together.

Anomaly Detection

ML models trained per table detect unusual row counts, null rates, and value distributions without manual threshold configuration. Sensitivity is tunable from low (fewer alerts) to high (catch subtle shifts).

End-to-End Lineage

Automated lineage traverses SQL query logs to build a directed acyclic graph from raw sources through transformation layers to dashboards and reverse ETL destinations, enabling root cause analysis in seconds.

Data SLAs

Define freshness, volume, and quality SLAs per table or domain. Breach events route to on-call via PagerDuty or Slack with the full lineage impact scope pre-attached to the incident.

The Five Pillars of Data Observability

Monte Carlo's observability model covers five dimensions that together provide complete coverage of data health. Each pillar is monitored automatically after warehouse connection, with no manual schema annotation required.

Freshness

Detects when a table stops being updated on its expected schedule. Monte Carlo learns the update cadence of each table from historical metadata — a table that normally updates every hour triggers an alert if it hasn't been written to in 90 minutes. Freshness monitoring does not require query access — it reads information_schema.tables last_modified timestamps and BigQuery INFORMATION_SCHEMA.TABLE_METADATA_TIMELINE, making it zero-cost and zero-permission beyond metadata access.

Volume

Tracks row count per table update and alerts when the count deviates beyond learned bounds. Volume anomalies catch full-load jobs that only partially completed, upstream source truncations, or partition filter bugs that cause double-counting. The detector is trained on rolling 14-day history and accounts for day-of-week seasonality — a table that always has fewer rows on weekends won't trigger weekend alerts.

Distribution

Monitors statistical properties of each column: null rate, cardinality (number of distinct values), numeric mean and standard deviation for numeric fields, and top value frequency shifts for categorical fields. Distribution monitors catch upstream data source changes — a partner changing an enum value, an ETL bug setting a column to null, or a PII scrubbing job running against the wrong environment.

Schema

Tracks DDL changes: column additions, removals, and type changes. Schema alerts fire before downstream consumers break — a downstream dashboard or dbt model referencing a renamed column gets an incident before it surfaces as a dashboard error. Schema lineage shows exactly which downstream assets reference each changed column.

Lineage

Constructs the full data dependency graph by parsing SQL query logs from the warehouse query history. No instrumentation is required — Monte Carlo reads BigQuery INFORMATION_SCHEMA.JOBS, Snowflake QUERY_HISTORY, or Redshift STL_QUERY to extract table references and builds directed edges. The resulting graph links raw source tables → staging transforms → mart tables → BI dashboards and reverse ETL targets.

Warehouse Connection and Permissions

Monte Carlo connects to your warehouse via a dedicated service account with read-only metadata permissions. For Snowflake, this means granting access to ACCOUNT_USAGE and INFORMATION_SCHEMA. For BigQuery, the service account needs bigquery.jobs.listAll, bigquery.tables.getIamPolicy, and bigquery.tables.listat the project level. The collector agent runs inside your VPC or cloud environment — Monte Carlo does not pull raw data; only metadata and query logs transit to Monte Carlo's SaaS backend.

-- Snowflake: create the Monte Carlo service role
CREATE ROLE montecarlo_role;

-- Metadata access for freshness and schema monitoring
GRANT IMPORTED PRIVILEGES ON DATABASE snowflake TO ROLE montecarlo_role;

-- Table sampling access for distribution monitoring
-- (only needed if enabling field-level health monitors)
GRANT USAGE ON WAREHOUSE analytics_wh TO ROLE montecarlo_role;
GRANT USAGE ON DATABASE analytics TO ROLE montecarlo_role;
GRANT USAGE ON ALL SCHEMAS IN DATABASE analytics TO ROLE montecarlo_role;
GRANT SELECT ON ALL TABLES IN DATABASE analytics TO ROLE montecarlo_role;
GRANT SELECT ON FUTURE TABLES IN DATABASE analytics TO ROLE montecarlo_role;

-- Assign to service account user
CREATE USER montecarlo_svc
  PASSWORD = '<rotate_via_vault>'
  DEFAULT_ROLE = montecarlo_role
  MUST_CHANGE_PASSWORD = FALSE;
GRANT ROLE montecarlo_role TO USER montecarlo_svc;
# BigQuery: Terraform service account and IAM bindings
resource "google_service_account" "montecarlo" {
  account_id   = "montecarlo-collector"
  display_name = "Monte Carlo Observability Collector"
}

# Metadata and lineage (query log) access
resource "google_project_iam_member" "montecarlo_jobs" {
  project = var.project_id
  role    = "roles/bigquery.resourceViewer"
  member  = "serviceAccount:${google_service_account.montecarlo.email}"
}

resource "google_project_iam_member" "montecarlo_jobs_list" {
  project = var.project_id
  role    = "roles/bigquery.jobUser"
  member  = "serviceAccount:${google_service_account.montecarlo.email}"
}

# Field-level health monitoring (table sampling)
resource "google_project_iam_member" "montecarlo_data_viewer" {
  project = var.project_id
  role    = "roles/bigquery.dataViewer"
  member  = "serviceAccount:${google_service_account.montecarlo.email}"
}

Note

Monte Carlo uses sampling, not full table scans, for distribution monitoring. For a 500M-row table, the collector samples 10,000–100,000 rows proportionally across partitions. This keeps warehouse query costs negligible — typically under $10/month in BigQuery slot-seconds for a 200-table deployment. You can configure the sampling rate and exclude sensitive tables from field-level health monitoring entirely without losing freshness or volume coverage.

ML-Based Anomaly Detection and Custom Monitors

Monte Carlo's automated monitors train a time-series model per table per metric (row count, null rate, mean, etc.) using 14 days of history. The model accounts for daily and weekly seasonality, trending growth curves, and known incident periods that should be excluded from training. Sensitivity controls the width of the anomaly band: low sensitivity (1.5σ equivalent) produces fewer alerts with higher confidence; high sensitivity (0.75σ equivalent) catches subtle shifts but generates more noise for tables with inherently variable data.

Custom monitors supplement automated detection for business-logic rules the ML model cannot infer — a revenue column that must never go negative, an orders table that must have at least 1000 rows by 9 AM UTC, or a referential integrity check between two tables. Data pipeline testing covers the complementary layer of contract-based validation with Great Expectations and schema registries that run at pipeline execution time rather than after the fact — combining both gives you shift detection and rule enforcement.

# Monte Carlo Python SDK — create custom monitors programmatically
# pip install pycarlo

from pycarlo.core import Client, Session

session = Session(mcd_id="your-mcd-id", mcd_token="your-mcd-token")
client = Client(session=session)

# --- Metric-based monitor: row count must be >= 1000 by 9 AM UTC ---
create_monitor_mutation = """
mutation CreateCustomMetricMonitor(
  $dwId: UUID!,
  $fullTableId: String!,
  $description: String!,
  $monitorType: String!,
  $aggTimeInterval: String,
  $aggSelect: String,
  $whereCondition: String,
  $comparisons: [MetricMonitorComparisonInput!]!
) {
  createOrUpdateMonitor(
    dwId: $dwId,
    fullTableId: $fullTableId,
    description: $description,
    monitorType: $monitorType,
    aggTimeInterval: $aggTimeInterval,
    aggSelect: $aggSelect,
    whereCondition: $whereCondition,
    comparisons: $comparisons
  ) {
    monitor { uuid }
  }
}
"""

response = client(
  create_monitor_mutation,
  variables={
    "dwId": "your-warehouse-uuid",
    "fullTableId": "analytics:production.orders",
    "description": "Orders table must have >= 1000 rows by 09:00 UTC",
    "monitorType": "CUSTOM_METRIC",
    "aggTimeInterval": "DAY",
    "aggSelect": "COUNT(*)",
    "whereCondition": "created_at >= TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(), DAY)",
    "comparisons": [
      {
        "comparisonType": "THRESHOLD",
        "operator": "GTE",
        "threshold": 1000,
        "isThresholdRelative": False,
      }
    ],
  },
)
print(response["createOrUpdateMonitor"]["monitor"]["uuid"])

# --- Field health monitor: revenue must never be negative ---
field_monitor_mutation = """
mutation {
  createOrUpdateMonitor(
    dwId: "your-warehouse-uuid",
    fullTableId: "analytics:production.orders",
    description: "Revenue column must be non-negative",
    monitorType: "FIELD_HEALTH",
    fields: ["revenue_usd"],
    comparisons: [{
      comparisonType: "THRESHOLD",
      metric: "MIN",
      operator: "GTE",
      threshold: 0
    }]
  ) { monitor { uuid } }
}
"""
client(field_monitor_mutation)
# Query active incidents via the Monte Carlo API
list_incidents_query = """
query ListIncidents($first: Int, $after: String) {
  getIncidents(first: $first, after: $after) {
    edges {
      node {
        id
        status
        severity
        title
        createdTime
        tables {
          fullTableId
        }
        anomalies {
          dimension
          metric
          value
          expectedValue
        }
      }
    }
    pageInfo { endCursor hasNextPage }
  }
}
"""

page_cursor = None
all_incidents = []

while True:
    response = client(
        list_incidents_query,
        variables={"first": 50, "after": page_cursor},
    )
    data = response["getIncidents"]
    all_incidents.extend(data["edges"])
    if not data["pageInfo"]["hasNextPage"]:
        break
    page_cursor = data["pageInfo"]["endCursor"]

# Filter to open, high-severity incidents touching the revenue domain
open_critical = [
    e["node"] for e in all_incidents
    if e["node"]["status"] == "ACTIVE"
    and any("revenue" in t["fullTableId"] for t in e["node"]["tables"])
]
print(f"Open critical incidents in revenue domain: {len(open_critical)}")

Field-Level Lineage — Tracing Data from Source to Dashboard

Table-level lineage tells you that Table B depends on Table A. Field-level lineage tells you that the revenue_usd column in the fct_orders mart is computed from amount_cents / 100 in stg_orders, which originates from raw.stripe_charges.amount. When a schema change alert fires on the raw source, Monte Carlo surfaces the full downstream blast radius — exactly which dashboards, dbt models, and reverse ETL syncs depend on the changed field — within the incident card.

Monte Carlo extracts field-level lineage by parsing SQL SELECT expressions from query history. It handles CTEs, subqueries, UNION statements, and window functions. For dbt projects, Monte Carlo additionally ingests the dbt manifest to link dbt model names to their compiled SQL and warehouse table identities, giving you lineage annotations that include the dbt model name, owner, and test coverage alongside the SQL path. OpenTelemetry for Data Pipelines covers how distributed tracing complements lineage by capturing pipeline execution spans, enabling cross-system correlation between orchestrator run IDs and warehouse query IDs — linking Monte Carlo incidents to the specific Airflow DAG run that produced the anomalous data.

# Query field-level lineage for a specific column
lineage_query = """
query FieldLineage($fullTableId: String!, $field: String!) {
  getFieldLineage(fullTableId: $fullTableId, field: $field) {
    upstream {
      tableId
      field
      queryHash
    }
    downstream {
      tableId
      field
      assetType   # TABLE, BI_DASHBOARD, REVERSE_ETL
    }
  }
}
"""

result = client(
    lineage_query,
    variables={
        "fullTableId": "analytics:production.fct_orders",
        "field": "revenue_usd",
    },
)

lineage = result["getFieldLineage"]
print("Upstream sources:")
for node in lineage["upstream"]:
    print(f"  {node['tableId']}.{node['field']}")

print("\nDownstream consumers:")
for node in lineage["downstream"]:
    print(f"  [{node['assetType']}] {node['tableId']}.{node['field']}")

Data SLAs — Freshness, Volume, and Quality Agreements

Data SLAs in Monte Carlo define the expected state of a table at a specific time — the orders mart must be fresh by 07:00 UTC or the revenue table must have at least 500 rows per daily partition. SLA breaches generate incidents independently of anomaly detection, ensuring that pipeline delays surface as actionable alerts even when the eventual data is statistically normal.

# Monte Carlo SLA configuration via the pycarlo SDK
create_sla_mutation = """
mutation CreateSla(
  $dwId: UUID!,
  $fullTableId: String!,
  $slaType: String!,
  $slaConfig: SlaConfigInput!
) {
  createSla(
    dwId: $dwId,
    fullTableId: $fullTableId,
    slaType: $slaType,
    slaConfig: $slaConfig
  ) { sla { uuid slaType } }
}
"""

# Freshness SLA: orders mart must update by 07:00 UTC daily
client(
    create_sla_mutation,
    variables={
        "dwId": "your-warehouse-uuid",
        "fullTableId": "analytics:production.fct_orders",
        "slaType": "FRESHNESS",
        "slaConfig": {
            "breachRules": [
                {
                    "windowDuration": "PT7H",   # breach if not updated by 07:00 UTC
                    "windowOffset": "PT0H",      # relative to midnight UTC
                    "timezone": "UTC",
                }
            ],
            "notificationChannels": [
                {"channelType": "SLACK", "channelId": "C0123DATATEAM"},
                {"channelType": "PAGERDUTY", "integrationKey": "pd-orders-oncall"},
            ],
        },
    },
)

# Volume SLA: at least 10,000 rows per daily partition
client(
    create_sla_mutation,
    variables={
        "dwId": "your-warehouse-uuid",
        "fullTableId": "analytics:production.fct_orders",
        "slaType": "VOLUME",
        "slaConfig": {
            "breachRules": [
                {
                    "threshold": 10000,
                    "operator": "GTE",
                    "windowDuration": "P1D",
                }
            ],
            "notificationChannels": [
                {"channelType": "SLACK", "channelId": "C0123DATATEAM"},
            ],
        },
    },
)

Note

SLA windows use ISO 8601 duration strings. PT7H means 7 hours, P1D means 1 day. For tables that update on a business-day schedule only, configure the SLA with weekday-only breach windows by setting activeDays: ["MONDAY","TUESDAY","WEDNESDAY","THURSDAY","FRIDAY"] in the breach rule to avoid weekend false positives.

dbt Integration — Model Lineage, Tests, and Freshness Bridging

Monte Carlo's dbt integration ingests the compiled manifest.json and run_results.json artifacts from your CI/CD pipeline. This enriches the lineage graph with dbt model metadata — model name, owner (from meta.owner), tags, description, and dbt test results — and links dbt model identities to their warehouse table identities for unified incident attribution.

# Upload dbt artifacts after each dbt run in CI
# Install: pip install 'pycarlo[dbt]'

from pycarlo.features.dbt import DbtHelper

helper = DbtHelper(
    mcd_id="your-mcd-id",
    mcd_token="your-mcd-token",
)

# Upload manifest and run_results after dbt run
helper.upload(
    manifest_path="target/manifest.json",
    run_results_path="target/run_results.json",
    project_name="analytics",       # matches your dbt_project.yml name
    environment="production",       # prod vs staging namespace
)

print("dbt artifacts uploaded to Monte Carlo")
# GitHub Actions: upload dbt artifacts to Monte Carlo after production run
name: dbt Production Run

on:
  schedule:
    - cron: "0 4 * * *"   # 04:00 UTC daily
  workflow_dispatch:

jobs:
  dbt-run:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.12"

      - name: Install dependencies
        run: pip install dbt-bigquery "pycarlo[dbt]"

      - name: Run dbt
        env:
          DBT_PROFILES_DIR: .
          GOOGLE_APPLICATION_CREDENTIALS_JSON: ${{ secrets.GCP_SA_KEY }}
        run: |
          echo "$GOOGLE_APPLICATION_CREDENTIALS_JSON" > /tmp/sa.json
          export GOOGLE_APPLICATION_CREDENTIALS=/tmp/sa.json
          dbt run --target prod --select tag:daily
          dbt test --target prod --select tag:daily

      - name: Upload artifacts to Monte Carlo
        env:
          MCD_ID: ${{ secrets.MONTE_CARLO_MCD_ID }}
          MCD_TOKEN: ${{ secrets.MONTE_CARLO_MCD_TOKEN }}
        run: |
          python - <<'EOF'
          from pycarlo.features.dbt import DbtHelper
          import os
          helper = DbtHelper(
              mcd_id=os.environ["MCD_ID"],
              mcd_token=os.environ["MCD_TOKEN"],
          )
          helper.upload(
              manifest_path="target/manifest.json",
              run_results_path="target/run_results.json",
              project_name="analytics",
              environment="production",
          )
          EOF

Airflow Integration — Tagging Failed Runs as Incidents

Monte Carlo's Airflow provider package installs an on_failure_callback that creates a Monte Carlo incident when a DAG task fails, linking the incident to all downstream tables in the lineage graph. This bridges the pipeline execution layer and the data observability layer — when an operator queries the incident list, they see both the Airflow task failure and any resulting data anomalies as a unified incident thread.

# pip install apache-airflow-providers-monte-carlo

from datetime import datetime
from airflow.decorators import dag, task
from monte_carlo_provider.operators.monte_carlo import MonteCarloIncidentCallback

# Attach the callback at DAG level — applies to all tasks
@dag(
    schedule="0 3 * * *",
    start_date=datetime(2026, 1, 1),
    catchup=False,
    default_args={
        "on_failure_callback": MonteCarloIncidentCallback(
            mcd_id="{{ var.value.monte_carlo_mcd_id }}",
            mcd_token="{{ var.value.monte_carlo_mcd_token }}",
            incident_severity="SEV2",
            affected_table_ids=[
                "analytics:production.fct_orders",
                "analytics:production.fct_revenue",
            ],
        ),
    },
)
def daily_revenue_pipeline():

    @task
    def extract_stripe_charges():
        # ... extract logic
        return {"rows": 12500}

    @task
    def transform_orders(extract_result: dict):
        # ... transformation logic
        pass

    @task
    def load_revenue_mart(transform_result):
        # ... load logic
        pass

    load_revenue_mart(transform_orders(extract_stripe_charges()))

daily_revenue_pipeline()

Note

The Monte Carlo Airflow callback creates incidents asynchronously — it does not block task retry logic. If a task fails and retries successfully, Monte Carlo receives both the failure event and, optionally, a resolution event if you configure the on_success_callback to call the resolution endpoint. Without resolution callbacks, incidents must be closed manually or auto-close after the downstream table next passes its freshness and volume checks.

Circuit Breakers — Stopping Downstream Consumption of Bad Data

A circuit breaker is a monitor that pauses downstream pipeline execution when an anomaly is detected, preventing bad data from propagating to BI tools, reverse ETL targets, or ML feature stores. Monte Carlo implements circuit breakers as webhook callbacks that Airflow sensors can poll: the sensor blocks the downstream DAG until Monte Carlo signals that the upstream table is healthy.

# Airflow circuit breaker pattern using the Monte Carlo table health sensor
from monte_carlo_provider.sensors.monte_carlo import MonteCarloTableHealthSensor
from airflow.decorators import dag, task
from datetime import datetime

@dag(
    schedule="0 8 * * *",   # runs after the upstream revenue pipeline (03:00 UTC)
    start_date=datetime(2026, 1, 1),
    catchup=False,
)
def downstream_revenue_export():

    # Block until Monte Carlo confirms fct_orders is healthy
    wait_for_healthy_orders = MonteCarloTableHealthSensor(
        task_id="wait_for_fct_orders_health",
        mcd_id="{{ var.value.monte_carlo_mcd_id }}",
        mcd_token="{{ var.value.monte_carlo_mcd_token }}",
        full_table_id="analytics:production.fct_orders",
        poke_interval=300,    # check every 5 minutes
        timeout=7200,         # fail after 2 hours if still unhealthy
        mode="reschedule",    # release worker between checks
        # health_check_types defines which monitors must pass
        health_check_types=["FRESHNESS", "VOLUME", "FIELD_HEALTH"],
    )

    @task
    def export_to_salesforce():
        # ... Salesforce reverse ETL
        pass

    wait_for_healthy_orders >> export_to_salesforce()

downstream_revenue_export()

Production Checklist

1

Grant only the minimum required permissions to the Monte Carlo service account and rotate credentials quarterly. For Snowflake, create a dedicated role with IMPORTED PRIVILEGES on the SNOWFLAKE database for metadata access and SELECT on monitored databases for field-level health. Avoid granting ACCOUNTADMIN — Monte Carlo requires no DDL or DML permissions. Store the service account credentials in your secrets manager (Vault, AWS Secrets Manager) and supply them to Monte Carlo via the API rather than pasting them in the UI permanently.

2

Enable field-level lineage on your most critical tables first, not the entire warehouse. Field-level lineage parsing is CPU-intensive for Monte Carlo's backend — if your warehouse query history has millions of queries per day (common in Snowflake and BigQuery enterprise accounts), start with the 20–30 tables that generate the most incident volume and expand coverage incrementally. You can configure table-level lineage only for less critical paths to reduce processing overhead.

3

Calibrate anomaly detector sensitivity per table based on inherent variability. Tables with high natural variance — event tables, clickstream data, external API syncs — benefit from low sensitivity (0.5–1.5σ) to avoid alert fatigue. Tables with stable row counts and strict business rules — dimension tables, billing records, compliance exports — should use high sensitivity (0.25–0.5σ) to catch subtle shifts early. The default medium sensitivity is a reasonable starting point but will miss real anomalies on stable tables and generate noise on volatile ones.

4

Configure SLA breach windows to match business SLA agreements, not technical pipeline schedules. If the data team's SLA to the business is 'revenue data is available by 08:00 UTC', set the Monte Carlo SLA breach window to 08:00 UTC regardless of whether the pipeline runs at 04:00. This aligns observability with the commitment rather than the implementation, and ensures that pipeline delays that still meet the business SLA do not generate unnecessary incidents.

5

Upload dbt artifacts on every production run, not just failed runs. Monte Carlo uses run_results.json to track dbt test outcomes and model execution times alongside warehouse data health — if you only upload on failure, Monte Carlo loses the test history needed to correlate data anomalies with specific dbt test regressions. Integrate the pycarlo dbt upload step as the final step of your production dbt CI job, running unconditionally regardless of dbt test outcome.

6

Set up notification routing by domain rather than broadcasting all incidents to a single channel. Use Monte Carlo's domain concept to group tables by owning team (revenue, growth, platform, data-infra) and route incidents to the team's Slack channel and PagerDuty service. A revenue anomaly should page the revenue analytics team, not wake up the platform engineering on-call. Configure severity tiers: SEV1 (breach of revenue SLA) → PagerDuty immediate, SEV2 (anomaly on critical mart) → Slack alert, SEV3 (anomaly on staging or internal tooling table) → digest only.

7

Define an incident response runbook that includes the Monte Carlo lineage view as the first step. When a data incident is declared, the on-call engineer should open the Monte Carlo incident card, navigate to the lineage graph, and identify the furthest upstream anomaly before touching any pipeline code. Root-causing at the wrong layer — fixing a transformation when the source itself is broken — wastes time and can introduce regressions. Monte Carlo's lineage view collapses the typical root-cause investigation from 30–60 minutes to under 5 minutes for most incidents.

8

Test circuit breaker sensor timeouts to ensure downstream pipelines fail safely within SLA windows. A circuit breaker sensor configured with a 2-hour timeout on an 8-hour SLA downstream window is safe. A circuit breaker with a 4-hour timeout on a 5-hour downstream window means the downstream DAG may not leave enough time to complete before its own SLA deadline if it triggers after waiting. Model the worst-case scenario: upstream anomaly detected at T+0, circuit breaker holds for full timeout, downstream starts at T+timeout, downstream completes at T+timeout+runtime. Verify the total is within your end-to-end SLA.

9

Review the Monte Carlo coverage report monthly to identify tables with zero monitors. As your warehouse grows with new pipelines and EL tools (Fivetran, Airbyte, custom Python jobs), new tables appear that are not automatically enrolled in any custom monitors or SLAs. Monte Carlo's UI surfaces tables with no incident history and no configured monitors — these are blind spots in your observability coverage. Schedule a monthly review to enroll critical new tables in at least freshness and volume monitoring.

10

Track mean-time-to-detect (MTTD) and mean-time-to-resolve (MTTR) as the primary KPIs for your data observability program, not alert count. A low MTTD means Monte Carlo is catching anomalies quickly; a low MTTR means your team has the lineage context and runbooks to resolve incidents efficiently. Alert count is a lagging and misleading metric — high alert count may indicate detector miscalibration rather than more data issues. Export incident timestamps from the Monte Carlo API monthly and calculate MTTD and MTTR per domain to track program health over time.

Discovering data quality issues from angry Slack messages by stakeholders instead of automated alerts, unable to answer which dashboards are affected when a source table schema changes, or spending hours tracing a revenue discrepancy through five transformation layers because you have no data lineage?

We design and implement data observability programs with Monte Carlo — from warehouse service account and permission setup for BigQuery, Snowflake, Redshift, and Databricks, through anomaly detector sensitivity calibration per table domain, custom metric and field health monitor creation via the pycarlo SDK, data SLA configuration aligned to business SLA commitments with PagerDuty and Slack on-call routing, dbt artifact upload pipeline integration in GitHub Actions CI, Airflow on_failure_callback wiring for incident attribution, circuit breaker sensor implementation for downstream pipeline protection, incident response runbook authoring with lineage-first root cause workflows, domain-based notification routing configuration, and monthly coverage audits to ensure all critical tables are enrolled in observability coverage. Let’s talk.

Let's Talk

Related Articles

DataSOps Consulting

Need help implementing this in production?

We build and operate data pipelines, AI systems, and observability stacks for engineering teams. Reach out for a free 30-minute architecture review.