Back to Blog
Apache NessieApache IcebergDelta LakeData LakeLakehouseData VersioningData EngineeringSparkTrinoGitOps

Apache Nessie — Git-Like Version Control for Data Lakes with Iceberg and Delta Lake

A practical guide to Apache Nessie for data lake version control: the catalog-as-code model where every table change is a commit on a Merkle hash tree, named references (branches and tags) pointing to independent table metadata states across Iceberg and Delta Lake tables, version store backends from RocksDB for development to DynamoDB and MongoDB for production HA deployments, Kubernetes Deployment with IRSA for DynamoDB access and OIDC authentication for per-role branch permissions, pynessie client for programmatic branch creation, merge with dry-run conflict detection, tag-based rollback via assign_branch, and cross-branch table metadata inspection, PySpark session configuration with the Nessie Iceberg catalog connector targeting specific branches, CREATE BRANCH and SHOW LOG SQL extensions, per-read branch overrides with the AT BRANCH syntax for cross-branch quality comparisons without session switching, Trino dual-catalog setup with separate iceberg_prod and iceberg_staging catalog properties files pointing at main and staging branches for Trino-based pre-merge quality gate SQL comparing row counts null rates and aggregate drift between branches, GitHub Actions data lake CI/CD pipeline that creates a PR-scoped Nessie branch, runs dbt models on the branch, runs dbt tests, executes cross-branch quality gate Trino SQL, and atomically merges to main only on full pass, PyIceberg catalog API for safe and unsafe schema evolution on feature branches with add_column, rename_column, delete_column, and partition spec evolution that does not rewrite existing data files, dbt profiles.yml with branch-aware Spark configuration reading the active branch from vars, incremental dbt models running transparently across branch environments, and a 10-point production checklist covering DynamoDB version store, OIDC RBAC for main branch protection, pre-deploy tagging, branch lifecycle management, GC policy configuration, Prometheus metrics, library version pinning, domain-per-server isolation, conflict detection for long-lived branches, and write-to-main enforcement via access control.

2026-06-20

What Is Apache Nessie and Why Data Lakes Need Git Semantics

Apache Nessie is an open-source transactional catalog for data lakes that brings Git-style branching, tagging, and merging to table metadata. It sits between your compute engines — Spark, Flink, Trino, Dremio — and your object storage, tracking table states as a Merkle tree of content hashes. Every table change (schema evolution, partition add, file append) is a commit. Every branch is an independent head pointer. Merging a branch is an atomic multi-table operation that either succeeds entirely or not at all.

The core problem Nessie solves is the lack of isolation in traditional Hive Metastore environments. When a data engineer runs a backfill job on the production catalog, every downstream query sees partially-rewritten partitions until the job completes. There is no way to develop new table schemas in isolation, validate data quality before publishing, or roll back a bad migration without restoring from S3 versioning. Nessie makes these operations first-class: branch off main, do your work, run your checks, then merge — just like you would with code. This is especially powerful when combined with Apache Iceberg's snapshot-based table format, where table metadata files are immutable and a branch is simply a different pointer to a different snapshot chain.

Isolated Environments

Dev, staging, and prod share the same physical S3 files but have independent table metadata branches. No duplication of data, no environment sprawl.

Multi-Table Transactions

Atomic commits touching dozens of tables at once. Either all tables move to the new state or none do — no partial migrations leaking into downstream queries.

Instant Rollback

Every merge to main is a tagged commit. Rolling back a bad deploy is a pointer update — milliseconds, not hours of S3 restore operations.

Nessie Architecture — Catalog, Content Store, and Version Store

Nessie exposes a REST API that any engine with a Nessie catalog connector can call. The server has three internal layers: the REST layer (Quarkus HTTP server), the version store (the Merkle DAG of named references and content hashes), and the content store (the mapping from content IDs to Iceberg table metadata JSON locations or Delta Lake transaction log pointers).

The version store backend is pluggable. For development and small deployments, RocksDB works out of the box. For production, you want DynamoDB or MongoDB for durability and horizontal scaling. Each named reference (branch or tag) stores a HEAD hash. A commit stores a parent hash plus a set of content key-to-hash mappings representing the tables that changed. Looking up a table on a branch walks the commit chain until it finds the key. This is O(1) for the latest state (the HEAD commit always has the full snapshot of changed tables) and O(depth) for history traversal.

# Deploy Nessie with Docker Compose — development setup
# nessie-compose.yaml

version: "3.8"

services:
  nessie:
    image: ghcr.io/projectnessie/nessie:0.99.0
    ports:
      - "19120:19120"
    environment:
      # Version store backend: IN_MEMORY | ROCKSDB | DYNAMODB | MONGODB | JDBC
      NESSIE_VERSION_STORE_TYPE: ROCKSDB
      NESSIE_VERSION_STORE_PERSIST_ROCKS_DB_PATH: /nessie/data/rocks
      # Authentication — disable for local dev, enable in production
      NESSIE_SERVER_AUTHENTICATION_ENABLED: "false"
    volumes:
      - nessie-data:/nessie/data

  # MinIO as local S3-compatible object store
  minio:
    image: minio/minio:RELEASE.2026-01-16T01-44-05Z
    command: server /data --console-address ":9001"
    ports:
      - "9000:9000"
      - "9001:9001"
    environment:
      MINIO_ROOT_USER: minioadmin
      MINIO_ROOT_PASSWORD: minioadmin
    volumes:
      - minio-data:/data

volumes:
  nessie-data:
  minio-data:

# Start: docker compose -f nessie-compose.yaml up -d
# Nessie UI: http://localhost:19120/tree/main
# API docs: http://localhost:19120/q/swagger-ui
# Production Nessie on Kubernetes with DynamoDB version store
# nessie-deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nessie
  namespace: data-platform
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nessie
  template:
    metadata:
      labels:
        app: nessie
    spec:
      serviceAccountName: nessie-sa  # IRSA for DynamoDB access
      containers:
        - name: nessie
          image: ghcr.io/projectnessie/nessie:0.99.0
          ports:
            - containerPort: 19120
          env:
            - name: NESSIE_VERSION_STORE_TYPE
              value: DYNAMODB
            - name: QUARKUS_DYNAMODB_AWS_REGION
              value: us-east-1
            - name: NESSIE_VERSION_STORE_PERSIST_DYNAMO_TABLE_PREFIX
              value: nessie-prod
            # OIDC authentication — verify tokens against your IdP
            - name: NESSIE_SERVER_AUTHENTICATION_ENABLED
              value: "true"
            - name: QUARKUS_OIDC_AUTH_SERVER_URL
              value: "https://sso.company.com/realms/data-platform"
            - name: QUARKUS_OIDC_CLIENT_ID
              value: nessie
          resources:
            requests:
              cpu: "500m"
              memory: "1Gi"
            limits:
              cpu: "2"
              memory: "4Gi"
          readinessProbe:
            httpGet:
              path: /q/health/ready
              port: 19120
            initialDelaySeconds: 10
            periodSeconds: 5
          livenessProbe:
            httpGet:
              path: /q/health/live
              port: 19120
            initialDelaySeconds: 30
            periodSeconds: 10
---
apiVersion: v1
kind: Service
metadata:
  name: nessie
  namespace: data-platform
spec:
  selector:
    app: nessie
  ports:
    - port: 19120
      targetPort: 19120
  type: ClusterIP

Note

In multi-replica Nessie deployments, all server instances share the same DynamoDB or MongoDB backend — there is no distributed consensus between Nessie pods. Each request is stateless at the server level; the version store is the source of truth. This means horizontal scaling is straightforward: add replicas behind a load balancer, point them at the same DynamoDB table, and you get linear read throughput with strong-consistency writes guaranteed by DynamoDB conditional expressions.

Branching, Merging, and Multi-Table Transactions

The Nessie Python client (pynessie) exposes the full catalog API: create branches, list tables, read table metadata, commit changes, and merge branches. For programmatic workflows — data pipeline CI, schema migration scripts, quality gate automation — pynessie is the primary interface. For interactive exploration, the nessie CLI mirrors git commands closely enough that any engineer familiar with git can navigate the catalog without reading documentation.

# pip install pynessie pyiceberg

import pynessie
from pynessie import NessieClient

# Connect to Nessie
client = NessieClient(config={"endpoint": "http://nessie:19120/api/v2"})

# ── Branch operations ────────────────────────────────────────────────────────

# List all branches
for ref in client.list_references().references:
    print(f"{ref.type}: {ref.name} @ {ref.hash_[:12]}")
# main: main @ a3f8b2c1d4e5
# BRANCH: dev-feature-x @ b7c9d1e2f3a4

# Create a new branch from main
client.create_branch("feature/order-schema-v2", "main")

# Create a tag (immutable snapshot) for the current main state
client.create_tag("release/2026-06-20", "main")

# ── Table inspection on a specific branch ──────────────────────────────────

# List all tables on a branch
entries = client.get_all_entries(ref="feature/order-schema-v2")
for entry in entries.entries:
    print(f"{entry.type}: {'.'.join(entry.name.elements)}")

# Get Iceberg table metadata location for a specific table
content = client.get_content(
    ref="feature/order-schema-v2",
    key="analytics.orders",
)
print(f"Metadata location: {content.metadata_location}")
# s3://datalake/analytics/orders/metadata/v3.metadata.json

# ── Merge branch into main ──────────────────────────────────────────────────

# Dry-run merge to check for conflicts before committing
merge_result = client.merge(
    from_branch="feature/order-schema-v2",
    onto_branch="main",
    dry_run=True,
)
for conflict in merge_result.details:
    if conflict.conflict_type:
        print(f"CONFLICT on {conflict.key}: {conflict.conflict_type}")

# Actual merge (atomic — all table metadata updates land together)
if not any(d.conflict_type for d in merge_result.details):
    client.merge(
        from_branch="feature/order-schema-v2",
        onto_branch="main",
        dry_run=False,
        message="feat: expand orders schema with fulfillment fields",
    )
    print("Merge successful — deleting branch")
    client.delete_branch("feature/order-schema-v2")

# ── Rollback by reassigning main to a previous tag ─────────────────────────

# Get the hash of the pre-deployment tag
tag = client.get_reference("release/2026-06-19")

# Assign main to that hash (instant rollback)
client.assign_branch(
    branch="main",
    old_hash=client.get_reference("main").hash_,
    new_hash=tag.hash_,
)
print(f"Rolled back main to {tag.hash_[:12]}")

Spark Integration — Reading and Writing Iceberg Tables via Nessie

Spark uses the Iceberg catalog API to talk to Nessie. The Nessie catalog implementation translates Iceberg metadata operations (load table, commit snapshot, register table) into Nessie REST calls. From a Spark perspective, tables live at nessie.database.table and the active branch is set via a Spark session config. This means you can switch branches mid-session for cross-branch queries — useful for data quality comparisons before merging.

This pairs naturally with the open table format comparison — Nessie works with both Iceberg and Delta Lake, but the Iceberg integration is the most mature and widely deployed. Iceberg's metadata layer maps cleanly to Nessie's content model: each Iceberg snapshot is an immutable JSON file on S3, and a Nessie commit stores the pointer to the latest snapshot.

# PySpark session configured for Nessie catalog
# Requires: org.apache.iceberg:iceberg-spark-runtime-3.5_2.12
#           org.projectnessie.nessie-integrations:nessie-spark-extensions-3.5_2.12

from pyspark.sql import SparkSession

spark = (
    SparkSession.builder
    .appName("nessie-pipeline")
    .config("spark.jars.packages",
            "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.7.1,"
            "org.projectnessie.nessie-integrations:nessie-spark-extensions-3.5_2.12:0.99.0,"
            "org.apache.hadoop:hadoop-aws:3.3.4")
    # Register the Nessie catalog
    .config("spark.sql.catalog.nessie", "org.apache.iceberg.spark.SparkCatalog")
    .config("spark.sql.catalog.nessie.catalog-impl", "org.apache.iceberg.nessie.NessieCatalog")
    .config("spark.sql.catalog.nessie.uri", "http://nessie:19120/api/v2")
    .config("spark.sql.catalog.nessie.ref", "main")       # default branch
    .config("spark.sql.catalog.nessie.warehouse", "s3a://datalake/warehouse")
    # S3 credentials via AWS credentials provider chain
    .config("spark.hadoop.fs.s3a.aws.credentials.provider",
            "com.amazonaws.auth.WebIdentityTokenCredentialsProvider")
    # Nessie authentication (bearer token from OIDC)
    .config("spark.sql.catalog.nessie.authentication.type", "BEARER")
    .config("spark.sql.catalog.nessie.authentication.token", "your-oidc-token")
    .getOrCreate()
)

# ── Work on main branch ─────────────────────────────────────────────────────

spark.sql("CREATE DATABASE IF NOT EXISTS nessie.analytics")
spark.sql("""
    CREATE TABLE IF NOT EXISTS nessie.analytics.orders (
        order_id    STRING NOT NULL,
        customer_id STRING NOT NULL,
        status      STRING,
        amount      DECIMAL(15,2),
        created_at  TIMESTAMP
    )
    USING iceberg
    PARTITIONED BY (days(created_at))
    TBLPROPERTIES (
        'write.format.default' = 'parquet',
        'write.parquet.compression-codec' = 'zstd'
    )
""")

# ── Switch to a feature branch for safe development ─────────────────────────

# Create branch and switch — uses Nessie SQL extensions
spark.sql("CREATE BRANCH feature_fulfillment IN nessie FROM main")
spark.conf.set("spark.sql.catalog.nessie.ref", "feature_fulfillment")

# Schema evolution on the feature branch — does NOT affect main
spark.sql("""
    ALTER TABLE nessie.analytics.orders
    ADD COLUMNS (
        fulfillment_id STRING,
        shipped_at     TIMESTAMP,
        warehouse_code STRING
    )
""")

# Write test data to the branch
test_df = spark.createDataFrame([
    ("ord-001", "cust-abc", "shipped", 99.99, "2026-06-20 10:00:00", "ff-001", "2026-06-20 12:00:00", "WH-EU-01"),
], ["order_id", "customer_id", "status", "amount", "created_at", "fulfillment_id", "shipped_at", "warehouse_code"])
test_df.write.format("iceberg").mode("append").save("nessie.analytics.orders")

# ── Cross-branch comparison ─────────────────────────────────────────────────

# Read from main (original schema) and feature branch simultaneously
main_df = spark.read.format("iceberg").option("branch", "main").load("nessie.analytics.orders")
feature_df = spark.read.format("iceberg").option("branch", "feature_fulfillment").load("nessie.analytics.orders")

print(f"main row count: {main_df.count()}")
print(f"feature_fulfillment row count: {feature_df.count()}")

Note

The spark.sql.catalog.nessie.ref config sets the default branch for the entire SparkSession. You can override it per-read with .option("branch", "feature_fulfillment") or per-table with the AT BRANCH SQL syntax from the Nessie Spark extensions: SELECT * FROM nessie.analytics.orders AT BRANCH feature_fulfillment. This makes it trivial to run data quality checks comparing a feature branch to main without changing any session config.

Trino Integration — Query Federation Across Branches

Trino supports Nessie as an Iceberg catalog backend through the iceberg connector with catalog.type=nessie. Unlike Spark, Trino does not support dynamic branch switching mid-session. Instead, you configure separate catalog entries per branch — typically one for production pointing at main and one for staging pointing at a staging branch.

# Trino catalog configuration — /etc/trino/catalog/

# catalog/iceberg_prod.properties — points at Nessie main branch
connector.name=iceberg
iceberg.catalog.type=nessie
iceberg.nessie.uri=http://nessie:19120/api/v2
iceberg.nessie.default-reference-name=main
iceberg.nessie.authentication.type=BEARER
iceberg.nessie.authentication.token=${ENV:NESSIE_PROD_TOKEN}
iceberg.file-format=PARQUET
fs.native-s3.enabled=true
s3.region=us-east-1

# catalog/iceberg_staging.properties — points at Nessie staging branch
connector.name=iceberg
iceberg.catalog.type=nessie
iceberg.nessie.uri=http://nessie:19120/api/v2
iceberg.nessie.default-reference-name=staging
iceberg.nessie.authentication.type=BEARER
iceberg.nessie.authentication.token=${ENV:NESSIE_STAGING_TOKEN}
iceberg.file-format=PARQUET
fs.native-s3.enabled=true
s3.region=us-east-1
-- Cross-catalog quality gate query in Trino
-- Compare row counts and null rates between staging and production
-- before merging the staging branch into main

WITH prod_stats AS (
    SELECT
        COUNT(*)                                        AS row_count,
        SUM(CASE WHEN order_id IS NULL THEN 1 ELSE 0 END)  AS null_order_id,
        SUM(CASE WHEN amount   IS NULL THEN 1 ELSE 0 END)  AS null_amount,
        AVG(CAST(amount AS DOUBLE))                     AS avg_amount
    FROM iceberg_prod.analytics.orders
    WHERE created_at >= DATE_TRUNC('month', CURRENT_DATE)
),
staging_stats AS (
    SELECT
        COUNT(*)                                        AS row_count,
        SUM(CASE WHEN order_id IS NULL THEN 1 ELSE 0 END)  AS null_order_id,
        SUM(CASE WHEN amount   IS NULL THEN 1 ELSE 0 END)  AS null_amount,
        AVG(CAST(amount AS DOUBLE))                     AS avg_amount
    FROM iceberg_staging.analytics.orders
    WHERE created_at >= DATE_TRUNC('month', CURRENT_DATE)
)
SELECT
    p.row_count                                         AS prod_rows,
    s.row_count                                         AS staging_rows,
    ABS(p.row_count - s.row_count)                     AS row_diff,
    ABS(p.avg_amount - s.avg_amount)                   AS avg_amount_diff,
    CASE
        WHEN ABS(p.row_count - s.row_count) > p.row_count * 0.01 THEN 'FAIL_ROW_COUNT'
        WHEN s.null_order_id > 0                        THEN 'FAIL_NULL_KEYS'
        WHEN ABS(p.avg_amount - s.avg_amount) > 0.01   THEN 'FAIL_AMOUNT_DRIFT'
        ELSE 'PASS'
    END                                                 AS gate_result
FROM prod_stats p, staging_stats s;

Data Lake CI/CD — Gating Table Changes with GitHub Actions

The real power of Nessie emerges when you wire branching into your data pipeline CI/CD. The pattern mirrors software CI: every pull request creates a Nessie branch, runs transformations and quality checks on that branch, and only merges to main — atomically — when all gates pass. This is the data engineering equivalent of a feature branch workflow, and it eliminates the class of incidents where a broken migration lands in production before anyone notices.

This approach complements Unity Catalog's governance model — while Unity Catalog excels at column-level access control and cross-workspace data sharing, Nessie is the better choice for teams that prioritize open standards, self-managed deployments, and branching-first development workflows without vendor lock-in.

# .github/workflows/data-lake-ci.yml
# Runs on every PR that touches dbt models or pipeline code

name: Data Lake CI

on:
  pull_request:
    paths:
      - "dbt/**"
      - "pipelines/**"
      - "schemas/**"

env:
  NESSIE_URI: http://nessie.data-platform.svc:19120/api/v2
  WAREHOUSE: s3a://datalake/warehouse
  PR_BRANCH: pr-${{ github.event.pull_request.number }}

jobs:
  data-lake-ci:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Install pynessie and dependencies
        run: pip install pynessie dbt-spark dbt-core

      - name: Create Nessie branch for this PR
        env:
          NESSIE_TOKEN: ${{ secrets.NESSIE_TOKEN }}
        run: |
          python - <<'PYEOF'
          import pynessie
          client = pynessie.NessieClient(config={
              "endpoint": "$NESSIE_URI",
              "auth": {"type": "bearer", "token": "$NESSIE_TOKEN"},
          })
          # Delete branch if it exists from a previous run
          try:
              client.delete_branch("$PR_BRANCH")
          except Exception:
              pass
          client.create_branch("$PR_BRANCH", "main")
          print(f"Created branch: $PR_BRANCH")
          PYEOF

      - name: Run dbt on PR branch
        env:
          NESSIE_TOKEN: ${{ secrets.NESSIE_TOKEN }}
        run: |
          dbt run             --target ci             --vars '{"nessie_branch": "${{ env.PR_BRANCH }}", "nessie_uri": "${{ env.NESSIE_URI }}"}'             --select state:modified+

      - name: Run dbt tests
        run: |
          dbt test             --target ci             --vars '{"nessie_branch": "${{ env.PR_BRANCH }}", "nessie_uri": "${{ env.NESSIE_URI }}"}'

      - name: Run quality gate comparison (Trino)
        env:
          TRINO_HOST: trino.data-platform.svc
          NESSIE_TOKEN: ${{ secrets.NESSIE_TOKEN }}
        run: python scripts/quality_gate.py --branch "$PR_BRANCH" --baseline main

      - name: Merge to main on success
        if: success()
        env:
          NESSIE_TOKEN: ${{ secrets.NESSIE_TOKEN }}
        run: |
          python - <<'PYEOF'
          import pynessie
          client = pynessie.NessieClient(config={
              "endpoint": "$NESSIE_URI",
              "auth": {"type": "bearer", "token": "$NESSIE_TOKEN"},
          })
          result = client.merge(
              from_branch="$PR_BRANCH",
              onto_branch="main",
              dry_run=False,
              message="ci: merge PR #${{ github.event.pull_request.number }} — ${{ github.event.pull_request.title }}",
          )
          print(f"Merged {len(result.details)} table(s) to main")
          client.delete_branch("$PR_BRANCH")
          PYEOF

      - name: Cleanup branch on failure
        if: failure()
        env:
          NESSIE_TOKEN: ${{ secrets.NESSIE_TOKEN }}
        run: |
          python -c "
          import pynessie
          c = pynessie.NessieClient(config={'endpoint': '$NESSIE_URI', 'auth': {'type': 'bearer', 'token': '$NESSIE_TOKEN'}})
          c.delete_branch('$PR_BRANCH')
          print('Cleaned up branch $PR_BRANCH')
          "

Schema Evolution Patterns — Safe and Unsafe Changes

Iceberg's schema evolution rules and Nessie's branching model work together to make schema changes auditable and reversible. Iceberg tracks every schema version by ID; old readers can continue reading old snapshots with their schema even after a column is dropped. Nessie records the metadata pointer that references a specific schema version, so every branch sees a consistent schema even if main has evolved.

# Schema evolution via PyIceberg on a Nessie-backed catalog
# pip install pyiceberg[nessie,s3]

from pyiceberg.catalog import load_catalog

# Load the Nessie catalog targeting the feature branch
catalog = load_catalog(
    "nessie",
    **{
        "type": "nessie",
        "uri": "http://nessie:19120/api/v2",
        "ref": "feature/schema-v2",
        "warehouse": "s3://datalake/warehouse",
        "s3.region": "us-east-1",
    }
)

table = catalog.load_table("analytics.orders")

# ── SAFE schema changes ──────────────────────────────────────────────────────
# Add an optional column (backward-compatible: old readers ignore it)
with table.update_schema() as update:
    update.add_column(
        path="fulfillment_id",
        field_type="string",
        doc="External fulfillment system reference ID",
    )

# Rename a column (Iceberg tracks by ID, not name — backward-compatible)
with table.update_schema() as update:
    update.rename_column("order_ref", "external_ref")

# Widen a numeric type: int → long (backward-compatible)
with table.update_schema() as update:
    update.update_column("item_count", field_type="long")

# ── UNSAFE schema changes — require major version decision ──────────────────
# Dropping a column DOES remove it from metadata but old data files still
# contain the column bytes. New readers won't see the data. This is safe to
# do on a branch and then gate behind quality checks before merging to main.
with table.update_schema() as update:
    update.delete_column("internal_debug_flag")

# ── Partition evolution (Iceberg only) ─────────────────────────────────────
# Add a hidden partition — does not rewrite existing data files
with table.update_spec() as update:
    update.add_field(
        source_column_name="warehouse_code",
        partition_type="identity",
        partition_name="warehouse_code",
    )

# Verify the schema looks correct on the feature branch before merging
print(f"Schema ID: {table.schema().schema_id}")
for field in table.schema().fields:
    print(f"  [{field.field_id}] {field.name}: {field.field_type} {'(required)' if field.required else '(optional)'}")

Note

Never drop columns by running schema changes directly on the main branch in production. Always branch, make the change, run a Trino quality gate comparing row counts and null rates between the branch and main for the affected columns, then merge atomically. If the quality gate fails, delete the branch and nothing in production is affected. This is the same safety guarantee you get from a feature branch in software engineering — but for data.

dbt Integration — Profiles and Branch-Aware Models

The dbt-spark and dbt-trino adapters support Nessie-backed Iceberg catalogs natively — the catalog is just another Spark or Trino catalog. The branch-awareness comes from passing the active branch as a variable in the profile or via --vars at runtime. This makes dbt an ideal transformation layer in the data lake CI/CD workflow described above.

# profiles.yml — dbt-spark profile with Nessie catalog
# Supports branch overrides via --vars '{"nessie_branch": "feature/x"}'

nessie_warehouse:
  target: dev
  outputs:
    dev:
      type: spark
      method: thrift
      host: spark-thrift-server
      port: 10001
      schema: analytics
      # Iceberg via Nessie — branch set from vars or defaults to "dev"
      spark_config:
        spark.sql.catalog.nessie.catalog-impl: "org.apache.iceberg.nessie.NessieCatalog"
        spark.sql.catalog.nessie.uri: "http://nessie:19120/api/v2"
        spark.sql.catalog.nessie.ref: "{{ var('nessie_branch', 'dev') }}"
        spark.sql.catalog.nessie.warehouse: "s3a://datalake/warehouse"
      catalog: nessie

    ci:
      type: spark
      method: thrift
      host: spark-thrift-server
      port: 10001
      schema: analytics
      spark_config:
        spark.sql.catalog.nessie.catalog-impl: "org.apache.iceberg.nessie.NessieCatalog"
        spark.sql.catalog.nessie.uri: "http://nessie:19120/api/v2"
        spark.sql.catalog.nessie.ref: "{{ var('nessie_branch', 'main') }}"
        spark.sql.catalog.nessie.warehouse: "s3a://datalake/warehouse"
      catalog: nessie

    prod:
      type: spark
      method: thrift
      host: spark-thrift-server
      port: 10001
      schema: analytics
      spark_config:
        spark.sql.catalog.nessie.catalog-impl: "org.apache.iceberg.nessie.NessieCatalog"
        spark.sql.catalog.nessie.uri: "http://nessie:19120/api/v2"
        spark.sql.catalog.nessie.ref: "main"
        spark.sql.catalog.nessie.warehouse: "s3a://datalake/warehouse"
      catalog: nessie
# models/analytics/orders_daily.sql
-- Branch-aware dbt model — runs on whatever branch is active in the profile
-- In CI, this runs against the PR branch; in prod, against main.

{{ config(
    materialized='incremental',
    unique_key='order_date',
    incremental_strategy='merge',
    file_format='iceberg',
    properties={
        "write.format.default": "'parquet'",
        "write.parquet.compression-codec": "'zstd'",
    }
) }}

SELECT
    DATE_TRUNC('day', created_at)  AS order_date,
    status,
    COUNT(*)                       AS order_count,
    SUM(amount)                    AS total_revenue,
    AVG(amount)                    AS avg_order_value,
    COUNT(DISTINCT customer_id)    AS unique_customers
FROM {{ source('raw', 'orders') }}
{% if is_incremental() %}
WHERE created_at >= (SELECT MAX(order_date) FROM {{ this }})
{% endif %}
GROUP BY 1, 2

-- dbt test in schema.yml:
-- - dbt_utils.expression_is_true:
--     expression: "total_revenue >= 0"
-- - not_null:
--     column_name: order_date

Production Checklist

1

Use DynamoDB or MongoDB as the Nessie version store in production. RocksDB is single-node and not HA — losing the pod loses your branch history.

2

Enable OIDC authentication on the Nessie server and assign fine-grained roles: read-only for BI consumers, write on feature branches for engineers, merge-to-main restricted to CI service accounts.

3

Create a release tag on main before every production deployment. If the deployment fails quality gates, rollback is a single assign_branch call to the pre-deploy tag — sub-second.

4

Delete PR branches immediately after merge or failure. Stale branches accumulate and make the reference list unmanageable. Automate cleanup in your CI pipeline's finally block.

5

Set a garbage collection policy on Nessie (nessie.gc.enabled=true) to expire snapshots older than your retention window. Nessie GC coordinates with Iceberg GC to safely remove unreferenced metadata files from S3.

6

Instrument the Nessie server with Prometheus metrics (quarkus.micrometer.export.prometheus.enabled=true). Alert on high commit latency — a slow DynamoDB table or high conflict rate often precedes merge failures.

7

Pin the Nessie server, Spark Nessie catalog extension, and PyIceberg versions together. API compatibility between the server and client libraries is strict across minor versions.

8

Use separate Nessie servers for separate data domains (analytics, ml-features, raw-ingest) rather than one global server. This limits blast radius during outages and makes branch naming conventions manageable.

9

Run the Nessie Spark extension's conflict detection SQL (SHOW LOG IN nessie) before merging long-lived feature branches. Conflicts on high-write tables like append-only event streams can be tricky to resolve.

10

Never write directly to main in production. Enforce this via Nessie's access control — service accounts for production pipelines should only have write access to their own named branches, not to main.

Running backfills that corrupt downstream dashboards mid-run, lacking isolated dev and staging environments for table schema changes, or unable to roll back a bad data migration without hours of S3 restore work?

We design and implement Nessie-based data lake version control platforms — from DynamoDB version store deployment and Kubernetes IRSA configuration to OIDC branch-level RBAC, Spark and Trino Nessie catalog connector setup, PyIceberg schema evolution workflows on feature branches, pynessie-driven CI/CD pipeline design with atomic multi-table merges and pre-merge quality gate Trino SQL, dbt profile configuration with branch-aware vars, Prometheus metrics setup for commit latency and conflict rate alerting, and data lake GitOps workflows that gate every table change behind automated quality checks before landing on main. Let’s talk.

Let's Talk

Related Articles

DataSOps Consulting

Need help implementing this in production?

We build and operate data pipelines, AI systems, and observability stacks for engineering teams. Reach out for a free 30-minute architecture review.