Back to Blog
DatabricksUnity CatalogData GovernanceData LineageAccess ControlDelta LakeData EngineeringCloud

Databricks Unity Catalog — Unified Data Governance, Lineage, and Access Control

A comprehensive guide to Databricks Unity Catalog in production: the three-level namespace (metastore → catalog → schema → table), external location and storage credential setup for S3/ADLS/GCS, fine-grained access control with GRANT/REVOKE privilege hierarchies and inheritance model, column masking policies with Python UDFs and role-based access, row filtering for attribute-based data access, automated column-level data lineage via system.lineage tables, audit logging with system.access.audit for compliance reporting, Delta Sharing for zero-copy cross-organization data exchange, Unity Catalog Terraform provider for IaC governance, PySpark integration with managed/external tables and Unity Catalog Volumes, Hive Metastore migration with the UCX toolkit, and production patterns covering catalog isolation, tag-based classification, and workspace binding.

2026-06-05

The Governance Gap — Why Hive Metastore Breaks at Organizational Scale

For years, the Hive Metastore (HMS) was the de-facto metadata layer for Databricks workspaces. It tracked table schemas and locations, let Spark resolve schema.table references, and worked well enough when one team owned one workspace. At organisational scale — multiple workspaces, dozens of teams, shared data products, regulatory audit requirements — HMS shows its structural limits: access control is workspace-local (no cross-workspace ACLs), table-level permissions only (no column masking, no row filters), no built-in data lineage, no audit log, and no mechanism for securely sharing data with external organizations without copying it.

Databricks Unity Catalog replaces HMS with a purpose-built governance layer that solves all of these problems. It introduces a three-level object namespace, fine-grained access control that inherits downward, automated column-level data lineage, built-in audit logging via system tables, column masking and row filtering policies, and Delta Sharing for zero-copy cross-organization data exchange. This guide covers the full picture from initial setup to production governance patterns.

Unity Catalog Architecture — The Three-Level Namespace

Unity Catalog organises all data assets in a strict four-level hierarchy: metastore → catalog → schema → table (or volume, function, model). Every object reference uses three-part notation: catalog.schema.table. The metastore is the top-level container and is tied to an account-level identity, not a workspace. A single metastore can serve all workspaces in a Databricks account, making permissions set at the catalog level visible across all attached workspaces simultaneously.

  • Metastore — account-scoped; stores storage credentials, external locations, Delta Sharing providers, and the catalog registry. Typically one per cloud region.
  • Catalog — top-level namespace container, analogous to an HMS database prefix. Use separate catalogs for environments (prod, staging, dev) or domains (finance, logistics, ML).
  • Schema — logical grouping inside a catalog; maps to a Hive database. Organise by subject area: raw, curated, analytics, pii_restricted.
  • Table / Volume / Function— the leaf objects. Managed tables store data in the metastore's default storage; external tables point to data in registered external locations. Volumes manage non-tabular files (CSVs, JSONs, ML model artifacts).

Note

Unity Catalog is account-level infrastructure, not workspace-level. You configure the metastore once in the Databricks Account Console (or via Terraform), then assign it to workspaces. Workspace admins can manage catalogs and schemas within the metastore, but cannot create or delete the metastore itself. This separation is intentional: the governance boundary is the account, not individual workspaces.

Provisioning Unity Catalog with Terraform — Metastore, Credentials, and Catalogs

The Databricks Terraform provider covers the full Unity Catalog resource surface. The recommended IaC approach is to split Unity Catalog provisioning into two Terraform roots: one for account-level resources (metastore, storage credentials, external locations) run by a Databricks account admin, and one for catalog-level resources (catalogs, schemas, grants) delegated to workspace admins.

# ─── account-level/main.tf ───────────────────────────────────────────────────
# Requires provider configured with account-level auth (account_id + client_id/secret)

terraform {
  required_providers {
    databricks = { source = "databricks/databricks", version = "~> 1.40" }
    aws        = { source = "hashicorp/aws",         version = "~> 5.0"  }
  }
}

# 1. Unity Catalog metastore — one per AWS region
resource "databricks_metastore" "main" {
  name          = "prod-metastore-us-east-1"
  storage_root  = "s3://my-unity-catalog-root/metastore"
  owner         = "account-admins"
  region        = "us-east-1"
  force_destroy = false
}

# 2. Storage credential — IAM role Unity Catalog uses to access S3
resource "databricks_storage_credential" "data_lake" {
  name = "s3-data-lake-cred"
  aws_iam_role {
    role_arn = "arn:aws:iam::123456789012:role/unity-catalog-data-lake-role"
  }
  comment       = "Credential for production data lake bucket"
  metastore_id  = databricks_metastore.main.id
}

# 3. External location — registered S3 path Unity Catalog can access
resource "databricks_external_location" "data_lake" {
  name            = "s3-data-lake"
  url             = "s3://my-data-lake-bucket"
  credential_name = databricks_storage_credential.data_lake.name
  comment         = "Primary data lake — Delta tables and raw ingestion"
  metastore_id    = databricks_metastore.main.id
}

# 4. Attach metastore to workspace
resource "databricks_metastore_assignment" "prod_workspace" {
  metastore_id         = databricks_metastore.main.id
  workspace_id         = var.prod_workspace_id
  default_catalog_name = "prod"
}

# ─── workspace-level/main.tf ──────────────────────────────────────────────────
# Configured with workspace-level auth

# 5. Catalogs — one per environment
resource "databricks_catalog" "prod" {
  metastore_id = var.metastore_id
  name         = "prod"
  comment      = "Production data assets"
  properties = {
    environment = "production"
    owner_team  = "data-platform"
  }
}

resource "databricks_catalog" "dev" {
  metastore_id = var.metastore_id
  name         = "dev"
  comment      = "Development sandbox"
  isolation_mode = "ISOLATED"   # prevents dev from reading prod data
}

# 6. Schemas
resource "databricks_schema" "analytics" {
  catalog_name = databricks_catalog.prod.name
  name         = "analytics"
  comment      = "Curated analytics models (dbt output)"
}

resource "databricks_schema" "pii_restricted" {
  catalog_name = databricks_catalog.prod.name
  name         = "pii_restricted"
  comment      = "PII data — restricted to data-engineers and compliance team"
}

# 7. Catalog-level grants
resource "databricks_grants" "prod_catalog_analysts" {
  catalog = databricks_catalog.prod.name
  grant {
    principal  = "data-analysts"
    privileges = ["USE CATALOG"]
  }
}

resource "databricks_grants" "analytics_schema" {
  schema = "prod.analytics"
  grant {
    principal  = "data-analysts"
    privileges = ["USE SCHEMA", "SELECT"]
  }
  grant {
    principal  = "data-engineers"
    privileges = ["USE SCHEMA", "SELECT", "MODIFY", "CREATE TABLE"]
  }
}

Fine-Grained Access Control — Privilege Hierarchies and GRANT/REVOKE

Unity Catalog uses a privilege inheritance model: privileges granted on a catalog propagate downward to all schemas and tables within it, unless explicitly overridden. This allows you to design access tiers — grant USE CATALOG broadly, then selectively grant SELECT on specific schemas or tables, and REVOKE on sensitive objects.

-- ── Environment access tier ─────────────────────────────────────────────────

-- All analysts can browse the prod catalog
GRANT USE CATALOG ON CATALOG prod TO `data-analysts`;

-- Analysts can read everything in analytics schema
GRANT USE SCHEMA ON SCHEMA prod.analytics TO `data-analysts`;
GRANT SELECT    ON SCHEMA prod.analytics TO `data-analysts`;

-- Engineers get full write access to raw ingestion schema
GRANT ALL PRIVILEGES ON SCHEMA prod.raw TO `data-engineers`;

-- ── Table-level granularity ───────────────────────────────────────────────────

-- Restrict access to a specific high-value table
GRANT SELECT ON TABLE prod.analytics.revenue_summary
  TO `finance-team`;

-- Service principal used by Airflow ETL gets MODIFY but not DDL
GRANT USE CATALOG  ON CATALOG prod              TO `airflow-sp`;
GRANT USE SCHEMA   ON SCHEMA prod.raw           TO `airflow-sp`;
GRANT SELECT, MODIFY ON TABLE prod.raw.events   TO `airflow-sp`;

-- ── Sensitive schema — restrict entirely, then add back selectively ───────────
REVOKE ALL PRIVILEGES ON SCHEMA prod.pii_restricted FROM `data-analysts`;

-- Compliance team gets read access to PII schema
GRANT USE SCHEMA, SELECT ON SCHEMA prod.pii_restricted
  TO `compliance-team`;

-- ── Ownership transfers ───────────────────────────────────────────────────────
-- Owner can grant any privilege they hold
ALTER SCHEMA prod.analytics OWNER TO `data-engineering`;
ALTER TABLE  prod.analytics.orders OWNER TO `orders-team`;

-- ── Show effective permissions ────────────────────────────────────────────────
SHOW GRANTS ON TABLE prod.analytics.orders;
SHOW GRANTS TO `data-analysts`;

-- ── Dynamic group membership (via Databricks account groups) ─────────────────
-- Groups sync from your IdP (Entra ID / Okta SCIM) — no manual management
-- Grant to group; membership changes take effect immediately
GRANT SELECT ON SCHEMA prod.analytics TO `group:emea-analysts`;

Note

Unity Catalog privileges are additive — a principal receives the union of all grants applied to them directly, through groups they belong to, and through inheritance from parent objects. REVOKE only removes a specific grant; it does not override a grant from a parent object or a different group. If you need to block access that would otherwise be inherited, use catalog isolation mode (set isolation_mode = ISOLATED in Terraform) or structure your schema hierarchy so that sensitive data lives in a schema with no inherited privileges from the catalog.

Column Masking and Row Filtering — Attribute-Based Data Access Policies

Unity Catalog's column masking and row filtering features let you attach policy functions to tables that transform or restrict what different users see — without query-time configuration, view proliferation, or asking users to use specific query patterns. The policies are enforced at the engine level for every query that touches the table, including queries from BI tools, notebooks, and APIs.

Masking functions are SQL or Python UDFs registered in Unity Catalog. They receive the column value and return the masked version. The function can call current_user() and is_member() to implement role-based masking — engineers see the raw value; analysts see a truncated or hashed version.

-- ── Column masking: PII email address ───────────────────────────────────────

CREATE OR REPLACE FUNCTION prod.masking.mask_email(email STRING)
  RETURNS STRING
  LANGUAGE PYTHON
  COMMENT 'Mask email to first char + *** @ domain for non-PII-approved users'
AS $$
  if email is None:
    return None
  parts = email.split('@', 1)
  if len(parts) != 2:
    return '***'
  local, domain = parts
  # Full PII team members see the real email
  from databricks.sdk.runtime import spark as _spark
  is_pii_team = _spark.sql(
    "SELECT is_member('pii-data-team')"
  ).collect()[0][0]
  if is_pii_team:
    return email
  return local[0] + '***@' + domain
$$;

-- Apply column mask to the customers table
ALTER TABLE prod.pii_restricted.customers
  ALTER COLUMN email
  SET MASK prod.masking.mask_email
  USING COLUMNS (email);

-- ── Column masking: credit card last four digits only ─────────────────────────
CREATE OR REPLACE FUNCTION prod.masking.mask_card(card_number STRING)
  RETURNS STRING
  LANGUAGE SQL
  COMMENT 'Return last 4 digits; full number only for payment-team members'
AS $$
  CASE WHEN is_member('payment-team')
    THEN card_number
    ELSE CONCAT('****-****-****-', RIGHT(card_number, 4))
  END
$$;

ALTER TABLE prod.pii_restricted.payments
  ALTER COLUMN card_number
  SET MASK prod.masking.mask_card
  USING COLUMNS (card_number);

-- ── Row filter: users only see rows for their assigned region ─────────────────
CREATE OR REPLACE FUNCTION prod.security.user_region_filter(region STRING)
  RETURNS BOOLEAN
  LANGUAGE SQL
  COMMENT 'Return TRUE only for rows in the requesting user region'
AS $$
  -- Ops and data-engineers bypass the filter entirely
  is_member('data-engineers') OR is_member('ops-team')
  OR EXISTS (
    SELECT 1
    FROM   prod.security.user_region_map m
    WHERE  m.user_email = current_user()
    AND    m.region     = region
  )
$$;

-- Attach the row filter to the regional sales table
ALTER TABLE prod.analytics.sales
  SET ROW FILTER prod.security.user_region_filter ON (region);

-- Verify: run as a regular analyst — only EMEA rows visible
SELECT COUNT(*), region FROM prod.analytics.sales GROUP BY region;

-- Remove a mask or filter
ALTER TABLE prod.pii_restricted.customers
  ALTER COLUMN email DROP MASK;

ALTER TABLE prod.analytics.sales DROP ROW FILTER;

Automated Data Lineage — Column-Level Tracking via System Tables

Unity Catalog automatically captures column-level data lineage for all Delta tables — no instrumentation required. Every INSERT, MERGE, CREATE TABLE AS SELECT, or notebook/job write operation that touches a Unity Catalog table is analyzed to extract source-to-target column mappings. The lineage is surfaced in the Databricks UI Catalog Explorer and queryable from the system.lineage system tables.

-- ── Query column-level lineage for a downstream table ────────────────────────
SELECT
  source_table_full_name,
  source_column_name,
  target_table_full_name,
  target_column_name,
  transformation_type          -- DIRECT, TRANSFORMATION, or UNKNOWN
FROM system.lineage.column_lineage
WHERE target_table_full_name = 'prod.analytics.revenue_daily'
ORDER BY target_column_name, source_table_full_name;

-- ── Find all downstream dependents of a source table (impact analysis) ────────
WITH RECURSIVE downstream AS (
  SELECT DISTINCT target_table_full_name AS table_name
  FROM   system.lineage.table_lineage
  WHERE  source_table_full_name = 'prod.raw.orders'

  UNION ALL

  SELECT tl.target_table_full_name
  FROM   system.lineage.table_lineage tl
  JOIN   downstream d ON tl.source_table_full_name = d.table_name
)
SELECT DISTINCT table_name AS dependent_table
FROM   downstream
ORDER  BY 1;

-- ── Trace lineage of a specific column upstream ───────────────────────────────
WITH RECURSIVE upstream AS (
  SELECT source_table_full_name, source_column_name, transformation_type
  FROM   system.lineage.column_lineage
  WHERE  target_table_full_name = 'prod.analytics.revenue_daily'
  AND    target_column_name     = 'revenue_usd'

  UNION ALL

  SELECT cl.source_table_full_name, cl.source_column_name, cl.transformation_type
  FROM   system.lineage.column_lineage cl
  JOIN   upstream u
    ON   cl.target_table_full_name = u.source_table_full_name
    AND  cl.target_column_name     = u.source_column_name
)
SELECT DISTINCT
  source_table_full_name,
  source_column_name,
  transformation_type
FROM upstream
ORDER BY 1, 2;

-- ── Notebook / job lineage: what wrote to a table in the last 30 days? ────────
SELECT
  entity_type,    -- NOTEBOOK or JOB
  entity_name,
  created_by,
  event_time
FROM system.lineage.table_lineage
WHERE target_table_full_name = 'prod.analytics.orders_daily'
  AND event_time >= DATEADD(DAY, -30, CURRENT_TIMESTAMP())
ORDER BY event_time DESC
LIMIT 50;

Note

Unity Catalog lineage is captured on a best-effort basis for Spark SQL and Python DataFrame operations. Complex Python transformations that do not go through a Unity Catalog write (e.g. writing directly to DBFS or external paths outside registered external locations) will not generate lineage events. For complete lineage coverage, ensure all persistent data writes target Unity Catalog tables, and that Spark sessions run on Databricks Runtime 11.3 LTS or later with Unity Catalog enabled.

Audit Logging and System Tables — Access Monitoring and Compliance Queries

Unity Catalog writes all data access and permission events to the system.access schema as system tables. The system.access.audit table captures every SQL statement, REST API call, and permission change with the requesting identity, timestamp, and request parameters. This is the authoritative source for compliance reporting and anomaly detection.

-- ── Top data consumers by table in the last 7 days ──────────────────────────
SELECT
  request_params.full_name                      AS table_name,
  user_identity.email                           AS user_email,
  COUNT(*)                                      AS query_count,
  MAX(event_time)                               AS last_access
FROM system.access.audit
WHERE action_name   IN ('commandSubmit', 'selectFromTable', 'runCommand')
  AND service_name  = 'databricks'
  AND event_time   >= DATEADD(DAY, -7, CURRENT_TIMESTAMP())
  AND request_params.full_name IS NOT NULL
GROUP BY 1, 2
ORDER BY query_count DESC
LIMIT 100;

-- ── Permission change audit: who changed what and when ───────────────────────
SELECT
  event_time,
  user_identity.email      AS changed_by,
  action_name,             -- grantPrivilege, revokePrivilege, createTable, etc.
  request_params.securable_type,
  request_params.securable_full_name,
  request_params.changes
FROM system.access.audit
WHERE action_name IN (
  'grantPrivilege', 'revokePrivilege',
  'createTable', 'dropTable',
  'createCatalog', 'dropCatalog',
  'alterTable', 'createFunction'
)
  AND event_time >= DATEADD(DAY, -30, CURRENT_TIMESTAMP())
ORDER BY event_time DESC;

-- ── Detect unusual access: queries outside business hours (UTC) ──────────────
SELECT
  DATE(event_time)                AS access_date,
  HOUR(event_time)                AS hour_utc,
  user_identity.email             AS user_email,
  request_params.full_name        AS table_name,
  COUNT(*)                        AS query_count
FROM system.access.audit
WHERE action_name IN ('commandSubmit', 'selectFromTable')
  AND HOUR(event_time) NOT BETWEEN 7 AND 19   -- outside 07:00-19:00 UTC
  AND event_time >= DATEADD(DAY, -14, CURRENT_TIMESTAMP())
GROUP BY 1, 2, 3, 4
HAVING query_count > 5
ORDER BY query_count DESC;

-- ── Active users per catalog (license / chargeback reporting) ─────────────────
SELECT
  request_params.catalog_name     AS catalog,
  COUNT(DISTINCT user_identity.email) AS unique_users,
  COUNT(*)                        AS total_queries
FROM system.access.audit
WHERE action_name = 'commandSubmit'
  AND event_time >= DATE_TRUNC('month', CURRENT_DATE())
GROUP BY 1
ORDER BY total_queries DESC;

Delta Sharing — Zero-Copy Cross-Organization Data Sharing

Delta Sharing is an open protocol for sharing live Delta Lake data with external recipients — other cloud providers, external organizations, or teams on separate Databricks accounts — without copying or moving the underlying data. The sharing side creates a share (a logical bundle of tables or table partitions), then creates a recipient and grants them access. The recipient gets a credential file they can use with any Delta Sharing-compatible client (Databricks, PySpark, pandas, Power BI, Tableau, or the open-source delta-sharing Python client).

-- ── Provider side: create and configure a share ─────────────────────────────

-- Create a named share
CREATE SHARE q2_sales_share
  COMMENT 'Q2 2026 regional sales data for partner analysis';

-- Add a full table
ALTER SHARE q2_sales_share
  ADD TABLE prod.analytics.sales_summary;

-- Add a partitioned view: only EMEA rows, only 2026 data
ALTER SHARE q2_sales_share
  ADD TABLE prod.analytics.sales_transactions
  PARTITION BY (region = 'EMEA')
  AS sales_transactions_emea;     -- alias exposed to recipient

-- Create a recipient (open sharing — generates activation link)
CREATE RECIPIENT acme_corp_partner
  COMMENT 'ACME Corp BI team';

-- Grant the share to the recipient
GRANT SELECT ON SHARE q2_sales_share TO RECIPIENT acme_corp_partner;

-- Retrieve the activation link for the recipient
DESCRIBE RECIPIENT acme_corp_partner;

-- ── Recipient side (Python): read a shared table ──────────────────────────────
# pip install delta-sharing

import delta_sharing

# profile_file is the .share file downloaded from the activation link
client = delta_sharing.SharingClient(profile_file="acme_partner.share")

# List available shares and tables
shares = client.list_shares()
tables = client.list_all_tables()

# Load a shared table as a pandas DataFrame
df = delta_sharing.load_as_pandas(
    url="acme_partner.share#q2_sales_share.sales_transactions_emea",
    limit=10_000,
)

# Load as a Spark DataFrame (on Databricks)
df_spark = delta_sharing.load_as_spark(
    url="acme_partner.share#q2_sales_share.sales_transactions_emea"
)
df_spark.createOrReplaceTempView("shared_sales")

-- ── Monitor sharing activity ──────────────────────────────────────────────────
SELECT
  share_name,
  recipient_name,
  COUNT(*)                AS access_count,
  MAX(event_time)         AS last_access,
  SUM(num_files)          AS total_files_read
FROM system.sharing.recipient_access_events
WHERE event_time >= DATEADD(DAY, -30, CURRENT_TIMESTAMP())
GROUP BY 1, 2
ORDER BY access_count DESC;

PySpark Integration — Managed Tables, External Tables, and Volumes

Unity Catalog tables are first-class citizens in PySpark. Once a cluster is attached to a Unity Catalog-enabled workspace, all Spark SQL and DataFrame operations use the three-part catalog.schema.table namespace. The key operational difference is between managed tables(data stored in the metastore's default storage path, deleted with the table) and external tables (data stored at a registered external location, preserved if the table is dropped). For production data lakes, use external tables to decouple data lifecycle from catalog lifecycle.

Volumes extend Unity Catalog to non-tabular files. Mount a registered external location path as a Volume and access it at /Volumes/catalog/schema/volume-name/. This replaces DBFS mounts with a governed, lineage-tracked file path.

# ── Managed table: data stored in metastore default storage ──────────────────
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, current_timestamp, to_date
from pyspark.sql.types import StructType, StructField, StringType, DoubleType, DateType

spark = SparkSession.builder.getOrCreate()

# Create a managed Delta table using three-part naming
spark.sql("""
  CREATE TABLE IF NOT EXISTS prod.analytics.orders (
    order_id    STRING  NOT NULL,
    customer_id STRING  NOT NULL,
    amount_usd  DOUBLE,
    status      STRING,
    region      STRING,
    order_date  DATE
  )
  USING DELTA
  COMMENT 'Cleaned daily orders — managed by analytics team'
  TBLPROPERTIES (
    'delta.enableChangeDataFeed' = 'true',
    'delta.autoOptimize.optimizeWrite' = 'true',
    'delta.autoOptimize.autoCompact' = 'true'
  )
""")

# Write a DataFrame to a managed UC table
df = spark.read.parquet("s3://my-lake/raw/orders/date=2026-06-05/")
df.select(
    col("id").alias("order_id"),
    col("cust_id").alias("customer_id"),
    (col("amount") / 100).alias("amount_usd"),
    col("status"),
    col("region"),
    to_date(col("created_at")).alias("order_date"),
).write.format("delta").mode("append").saveAsTable("prod.analytics.orders")

# ── External table: data at a registered external location path ───────────────
spark.sql("""
  CREATE TABLE IF NOT EXISTS prod.raw.events_external
  USING DELTA
  LOCATION 's3://my-data-lake-bucket/raw/events'   -- path under registered external location
  COMMENT 'External Delta table — data persists if table is dropped'
""")

# ── Time travel on a Unity Catalog table ──────────────────────────────────────
# Read as of a specific timestamp
df_yesterday = spark.read     .option("timestampAsOf", "2026-06-04T00:00:00Z")     .table("prod.analytics.orders")

# Read as of a specific version
df_v10 = spark.read     .option("versionAsOf", "10")     .table("prod.analytics.orders")

# ── Unity Catalog Volumes — non-tabular file access ───────────────────────────
# Create a volume backed by an external location
spark.sql("""
  CREATE EXTERNAL VOLUME prod.analytics.raw_uploads
  LOCATION 's3://my-data-lake-bucket/uploads'
""")

# Read a file from a volume
df_csv = spark.read.format("csv")     .option("header", "true")     .option("inferSchema", "true")     .load("/Volumes/prod/analytics/raw_uploads/daily_export.csv")

# Write a model artifact to a volume
import mlflow
mlflow.set_tracking_uri("databricks")
with mlflow.start_run():
    mlflow.log_artifact(
        "/Volumes/prod/ml/artifacts/feature_matrix.parquet",
        artifact_path="features",
    )

Hive Metastore Migration — Upgrading to Unity Catalog Without Downtime

Databricks ships UCX (Unity Catalog Accelerator), an open-source migration toolkit that automates the assessment, mapping, and upgrade of Hive Metastore tables to Unity Catalog. The migration is in-place for managed tables (data moves to the metastore storage root) and a metadata-only operation for external tables (data stays in place, Unity Catalog registers the path under an external location).

# Install UCX via Databricks CLI (version 0.25+)
pip install databricks-cli

# Authenticate to your workspace
databricks configure --token

# Install UCX labs extension
databricks labs install ucx

# ── Step 1: Assessment — inventory all HMS tables and identify blockers ────────
databricks labs ucx run-assessment
# Generates: assessment dashboard in a new Databricks workspace folder
# Reports: tables with unsupported features, missing permissions, DBFS mounts

# ── Step 2: Create table mapping ─────────────────────────────────────────────
# Maps HMS database.table → catalog.schema.table
databricks labs ucx create-table-mapping
# Edit the generated CSV if you need custom catalog/schema assignments:
#   hive_metastore.default.orders → prod.analytics.orders
#   hive_metastore.raw.events     → prod.raw.events

# ── Step 3: Migrate external locations ───────────────────────────────────────
# Register DBFS paths and S3 mounts as external locations
databricks labs ucx migrate-locations

# ── Step 4: Migrate tables ────────────────────────────────────────────────────
databricks labs ucx migrate-tables   --target-catalog prod

# For specific schemas only:
databricks labs ucx migrate-tables   --target-catalog prod   --source-schemas default analytics raw

# ── Step 5: Post-migration verification ──────────────────────────────────────
# In a notebook — verify table accessible via UC naming
# %sql
SELECT COUNT(*) FROM prod.analytics.orders;

# Check lineage was captured during migration
SELECT * FROM system.lineage.table_lineage
WHERE target_table_full_name = 'prod.analytics.orders'
ORDER BY event_time DESC LIMIT 5;

# ── Step 6: Migrate cluster and job permissions ───────────────────────────────
databricks labs ucx migrate-permissions

# Validate no HMS references remain in production jobs
databricks labs ucx validate-groups-membership

Note

After migrating a table, both the original HMS path (hive_metastore.analytics.orders) and the UC path (prod.analytics.orders) may coexist temporarily. UCX creates a shadow table in HMS that redirects reads to the UC table, giving you a rollback window without breaking existing notebooks or jobs. Remove the HMS shadow tables only after verifying all consumers have been updated to the three-part UC naming.

Production Patterns — Catalog Design, Tag-Based Governance, and Monitoring

Governance at scale requires systematic patterns, not one-off grants. The most impactful patterns for Unity Catalog production deployments are: environment isolation via catalog structure, metadata tagging for data classification, automated grant management via Terraform, and system table dashboards for ongoing governance monitoring.

-- ── Tag-based data classification ───────────────────────────────────────────

-- Tag a table as containing PII data
ALTER TABLE prod.pii_restricted.customers
  SET TAGS ('pii' = 'true', 'classification' = 'confidential', 'owner' = 'data-privacy-team');

-- Tag individual columns with their sensitivity level
ALTER TABLE prod.pii_restricted.customers
  ALTER COLUMN email    SET TAGS ('sensitivity' = 'PII');
ALTER TABLE prod.pii_restricted.customers
  ALTER COLUMN phone    SET TAGS ('sensitivity' = 'PII');
ALTER TABLE prod.pii_restricted.customers
  ALTER COLUMN name     SET TAGS ('sensitivity' = 'quasi-identifier');

-- Query all PII-tagged tables across all catalogs
SELECT
  table_catalog,
  table_schema,
  table_name,
  tag_name,
  tag_value
FROM system.information_schema.table_tags
WHERE tag_name = 'pii' AND tag_value = 'true'
ORDER BY table_catalog, table_schema, table_name;

-- ── Catalog design for multi-environment, multi-domain orgs ──────────────────
-- Recommended namespace structure:
--   prod.raw.*            ← landing zone (append-only, data-engineers write)
--   prod.curated.*        ← cleansed data (dbt models, data-engineers write)
--   prod.analytics.*      ← business aggregates (dbt models, analysts read/write)
--   prod.pii_restricted.* ← PII tables (compliance team and data-engineers write)
--   prod.sharing.*        ← tables explicitly shared via Delta Sharing
--   staging.raw.*         ← staging environment (mirrors prod structure)
--   dev.*                 ← development catalog (ISOLATED mode — no prod access)
--   sandbox.*             ← analyst sandboxes (auto-expires via lifecycle policy)

-- ── Data quality contract: enforce NOT NULL via table constraints ─────────────
ALTER TABLE prod.analytics.orders
  ADD CONSTRAINT orders_order_id_not_null CHECK (order_id IS NOT NULL);

ALTER TABLE prod.analytics.orders
  ADD CONSTRAINT orders_amount_positive   CHECK (amount_usd > 0);

-- ── Monitor constraint violations (Delta CHECK constraints raise errors on write)
-- Test with an intentional bad row:
-- INSERT INTO prod.analytics.orders VALUES (NULL, 'c001', 10.0, 'placed', 'EU', '2026-06-05')
-- → AnalysisException: CHECK constraint violation

-- ── Workspace binding: restrict which workspaces can access a catalog ─────────
-- Prevents dev workspace from reading prod catalog by default
-- Configured via Account Console or Terraform:
-- resource "databricks_catalog_workspace_binding" ...
--   workspace_id = var.dev_workspace_id
--   catalog_name = "prod"
--   binding_type = "BINDING_TYPE_READ_ONLY"   -- or BINDING_TYPE_READ_WRITE

Work with us

Managing data governance across Databricks workspaces and struggling with access control, lineage visibility, or cross-team data sharing?

We design and implement Databricks Unity Catalog governance architectures — from external location setup and storage credential configuration to fine-grained GRANT/REVOKE privilege hierarchies, column masking policies, row filters, Delta Sharing catalogs for cross-organization data sharing, Unity Catalog Terraform provisioning, system table monitoring for audit and lineage, and Hive Metastore migration playbooks with UCX. Let’s talk.

Get in touch

Related Articles

DataSOps Consulting

Need help implementing this in production?

We build and operate data pipelines, AI systems, and observability stacks for engineering teams. Reach out for a free 30-minute architecture review.