Back to Blog
AWSLake FormationData GovernanceData CatalogCloudIAMData EngineeringSecurityS3Data Lake

AWS Lake Formation — Fine-Grained Access Control, Column Masking, and Data Catalog

A practical guide to AWS Lake Formation for centralized data lake governance: the Glue Data Catalog as the metadata layer that Lake Formation extends with fine-grained access control enforced consistently across Athena, Redshift Spectrum, and Glue ETL, the Lake Formation permission model versus S3-only IAM policies and why column-level and row-level restrictions cannot be implemented at the S3 level alone, IAM service role registration of S3 locations as Lake Formation-managed paths and the put_data_lake_settings admin designation that bootstraps the permission system, LF-Tag creation with create_lf_tag for classification (public, internal, confidential, restricted) and domain (sales, marketing, finance, hr) attribute keys, add_lf_tags_to_resource assigning multiple tags to databases, tables, and individual columns to build an attribute-based access control taxonomy, grant_permissions with LFTagPolicy expressions that apply to all current and future tables matching the tag combination eliminating per-table grants as the catalog grows, column-level security via ColumnWildcard.ExcludedColumnNames removing PII columns from the visible schema for Athena queries by BI tool service accounts, create_data_cells_filter with RowFilter.FilterExpression SQL WHERE clauses that are injected transparently into every query against the table, granting data cell filters to principals so regional analysts only see their region’s rows without any query modification on their part, Terraform resources aws_lakeformation_lf_tag, aws_lakeformation_resource, and aws_lakeformation_permissions for version-controlled governance-as-code, cross-account data sharing combining AWS RAM for Glue catalog database sharing and Lake Formation permission grants to consumer account IDs followed by glue.create_database with TargetDatabase pointing at the producer account for federated querying, CloudTrail data events for lakeformation.amazonaws.com with CloudWatch Logs Insights queries auditing denied access, column access frequency per principal, and cross-account grant detection, and a 10-point production checklist covering the IAMAllowedPrincipals revocation step, LF-Tag ABAC strategy over direct resource grants, row filter partition alignment for scan performance, PermissionsWithGrantOption scope restriction, service role S3 path scoping, Glue ETL job permissions, Terraform state management, cross-account RAM plus LF dual grant requirement, strict governance mode default permission settings, and CloudTrail data events for column-level audit logging.

2026-06-27

What is AWS Lake Formation and Why It Exists

Traditional S3-based data lakes rely on IAM bucket policies and S3 prefix grants for access control. This works at the storage level, but it cannot restrict which columns an analyst sees in a Parquet file, which rows a consumer can query, or enforce governance consistently across Athena, Redshift Spectrum, and Glue ETL jobs simultaneously. Each engine has its own access model and there is no central audit trail of who queried which columns.

AWS Lake Formation sits on top of the AWS Glue Data Catalog and intercepts metadata requests made by query engines. When Athena, Redshift Spectrum, or a Glue ETL job queries a Lake Formation-registered table, Lake Formation enforces column-level visibility and injects row-filter WHERE clauses before returning credentials to the query engine. The result is consistent governance across all engines from a single control plane. If your workloads run on Databricks rather than S3 and Athena, Databricks Unity Catalog provides equivalent fine-grained governance with column masking and row filters enforced across Databricks SQL, Spark, and ML — Lake Formation is the native choice when your query layer is Athena and your storage is S3.

Column-Level Security

Restrict which columns are visible to each IAM principal. Exclude PII columns (email, SSN, credit card) from analyst roles while allowing data engineers full schema access — enforced transparently at query time.

Row-Level Filters

SQL WHERE expressions injected into every query against a table. Regional analysts only see their region's rows. Multi-tenant tables filter by tenant ID. No query modifications required on the consumer side.

LF-Tag ABAC

Attribute-based access control via LF-Tags. Tag tables with classification=confidential and domain=finance, then grant a role access to all tables matching those tags — current and future tables included automatically.

Architecture — Catalog, Location Registry, and the Permission Model

Lake Formation builds a governance layer on top of three existing AWS services. Understanding how they interact explains why Lake Formation permissions are independent of — and override — S3 bucket policies for registered locations.

1

AWS Glue Data Catalog

The metadata store for databases, tables, and schemas. Lake Formation manages permissions on Glue Catalog resources — databases and tables — not directly on S3 objects. Every Athena and Redshift Spectrum query resolves table metadata through the catalog before touching S3.

2

Lake Formation Location Registry

S3 prefixes registered as Lake Formation-managed locations. When an S3 prefix is registered, Lake Formation issues short-lived, scoped credentials to query engines rather than relying on the calling principal's IAM role to have S3 GetObject directly. This is what enables column and row filtering at the metadata layer.

3

Lake Formation Permission Model

A separate permission layer on top of IAM. Lake Formation permissions are resource-based grants (database, table, columns, data filters) applied to IAM principals. Both Lake Formation permissions AND IAM permissions must allow an action — Lake Formation is additive on top of IAM, not a replacement.

4

Query Engine Integration

Athena, Redshift Spectrum, and Glue ETL call the Lake Formation GetTemporaryGlueTableCredentials API before reading data. Lake Formation returns S3 credentials scoped to only the S3 paths and columns the principal is permitted to see. The engine never issues credentials directly.

5

Data Lake Admin

IAM principals designated as Lake Formation admins have superuser access to all catalog resources and can grant permissions to others. Admin designation is separate from AWS account root and IAM admin — you explicitly designate admins via the Lake Formation console or put_data_lake_settings API.

Initial Setup — Enabling Lake Formation and Registering Locations

Before Lake Formation can enforce permissions, you register S3 locations with a dedicated service role and designate Lake Formation admins. The Terraform configuration below establishes the minimum viable governance foundation — service role, admin designation, and S3 location registration.

# terraform/lakeformation-foundation.tf

# Service role that Lake Formation uses to access registered S3 locations
resource "aws_iam_role" "lakeformation_service" {
  name = "LakeFormationServiceRole"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect    = "Allow"
      Principal = { Service = "lakeformation.amazonaws.com" }
      Action    = "sts:AssumeRole"
    }]
  })
}

resource "aws_iam_role_policy" "lakeformation_s3_access" {
  name = "LakeFormationS3DataLakeAccess"
  role = aws_iam_role.lakeformation_service.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = [
          "s3:GetObject",
          "s3:PutObject",
          "s3:DeleteObject",
          "s3:ListBucket"
        ]
        Resource = [
          "arn:aws:s3:::my-datalake-prod",
          "arn:aws:s3:::my-datalake-prod/*"
        ]
      }
    ]
  })
}

# Designate Lake Formation admins
# Only these principals can grant permissions to others
resource "aws_lakeformation_data_lake_settings" "main" {
  admins = [
    "arn:aws:iam::123456789012:role/DataPlatformAdminRole",
    "arn:aws:iam::123456789012:user/terraform-deployer",
  ]

  # Allow creating databases without explicit Lake Formation grants
  # Set to false in strict governance mode
  create_database_default_permissions {
    permissions = []
  }
  create_table_default_permissions {
    permissions = []
  }
}

# Register the S3 data lake location with the service role
resource "aws_lakeformation_resource" "datalake_bucket" {
  arn      = "arn:aws:s3:::my-datalake-prod"
  role_arn = aws_iam_role.lakeformation_service.arn
}

# Grant database-level access (required before table-level grants)
resource "aws_lakeformation_permissions" "analyst_describe_sales_db" {
  principal = "arn:aws:iam::123456789012:role/DataAnalystRole"

  database {
    name = "sales_db"
  }

  permissions             = ["DESCRIBE"]
  permissions_with_grant_option = []
}

Note

After enabling Lake Formation, disable the "IAMAllowedPrincipals" group on existing Glue databases — otherwise any IAM principal with Glue read permissions bypasses Lake Formation entirely. Run aws lakeformation batch-revoke-permissions to revoke the default IAMAllowedPrincipals grants that Glue created before Lake Formation was enabled. This is the most common reason Lake Formation permissions appear to have no effect.

LF-Tags — Attribute-Based Access Control at Scale

Direct resource grants — granting a role access to a specific table by name — require a new grant every time a table is added. With hundreds of tables across dozens of domains, this becomes unmanageable. LF-Tags solve this with attribute-based access control (ABAC): tag tables with key-value attributes like data_classification=confidential and domain=finance, then grant roles access to all tables matching a tag expression. LF-Tags pair naturally with data mesh domain ownership models — each domain team tags their own tables, and platform-level LF-Tag policies enforce cross-domain access boundaries without central per-table permission management.

# boto3 — Create and assign LF-Tags

import boto3

lf = boto3.client("lakeformation", region_name="us-east-1")

# Create classification tag with allowed values
lf.create_lf_tag(
    TagKey="data_classification",
    TagValues=["public", "internal", "confidential", "restricted"],
)

# Create domain tag
lf.create_lf_tag(
    TagKey="domain",
    TagValues=["sales", "marketing", "finance", "hr", "engineering"],
)

# Create PII indicator tag (used at column level)
lf.create_lf_tag(
    TagKey="contains_pii",
    TagValues=["true", "false"],
)

# Assign tags to a table
lf.add_lf_tags_to_resource(
    Resource={
        "Table": {
            "DatabaseName": "sales_db",
            "Name": "customer_orders",
        }
    },
    LFTags=[
        {"TagKey": "data_classification", "TagValues": ["confidential"]},
        {"TagKey": "domain", "TagValues": ["sales"]},
    ],
)

# Assign column-level PII tag to sensitive columns
lf.add_lf_tags_to_resource(
    Resource={
        "TableWithColumns": {
            "DatabaseName": "sales_db",
            "Name": "customer_orders",
            "ColumnNames": ["customer_email", "customer_phone", "billing_address"],
        }
    },
    LFTags=[
        {"TagKey": "contains_pii", "TagValues": ["true"]},
    ],
)

# Grant access via LF-Tag policy — works for ALL matching tables,
# including tables created in the future
lf.grant_permissions(
    Principal={
        "DataLakePrincipalIdentifier": "arn:aws:iam::123456789012:role/DataAnalystRole"
    },
    Resource={
        "LFTagPolicy": {
            "ResourceType": "TABLE",
            "Expression": [
                {"TagKey": "data_classification", "TagValues": ["public", "internal"]},
                {"TagKey": "domain", "TagValues": ["sales", "marketing"]},
            ],
        }
    },
    Permissions=["SELECT", "DESCRIBE"],
    PermissionsWithGrantOption=[],
)

# Senior analysts access confidential finance data
lf.grant_permissions(
    Principal={
        "DataLakePrincipalIdentifier": "arn:aws:iam::123456789012:role/SeniorAnalystRole"
    },
    Resource={
        "LFTagPolicy": {
            "ResourceType": "TABLE",
            "Expression": [
                {"TagKey": "data_classification", "TagValues": ["confidential"]},
                {"TagKey": "domain", "TagValues": ["finance"]},
            ],
        }
    },
    Permissions=["SELECT", "DESCRIBE"],
    PermissionsWithGrantOption=[],
)

Column-Level Security — Restricting Which Columns Are Visible

Column-level security lets you grant SELECT on a table while hiding specific columns. This is the standard pattern for PII protection: data engineers and senior analysts get all columns, while BI tool service accounts and external analysts only see non-sensitive fields. Lake Formation enforces this at the metadata layer — excluded columns do not appear in DESCRIBE TABLE output and any Athena query referencing them fails with a column-not-found error.

# Grant SELECT with excluded PII columns — BI tool service account
lf.grant_permissions(
    Principal={
        "DataLakePrincipalIdentifier": "arn:aws:iam::123456789012:role/BiToolServiceRole"
    },
    Resource={
        "TableWithColumns": {
            "DatabaseName": "sales_db",
            "Name": "customer_orders",
            "ColumnWildcard": {
                # All columns EXCEPT these are visible to the principal
                "ExcludedColumnNames": [
                    "customer_email",
                    "customer_phone",
                    "billing_address",
                    "credit_card_last4",
                    "ssn_hash",
                ]
            },
        }
    },
    Permissions=["SELECT"],
    PermissionsWithGrantOption=[],
)

# Grant full column access to data engineers
lf.grant_permissions(
    Principal={
        "DataLakePrincipalIdentifier": "arn:aws:iam::123456789012:role/DataEngineerRole"
    },
    Resource={
        "TableWithColumns": {
            "DatabaseName": "sales_db",
            "Name": "customer_orders",
            "ColumnWildcard": {},  # All columns — no exclusions
        }
    },
    Permissions=["SELECT", "INSERT", "DELETE", "ALTER"],
    PermissionsWithGrantOption=[],
)

# Alternatively — grant access to a specific list of allowed columns
lf.grant_permissions(
    Principal={
        "DataLakePrincipalIdentifier": "arn:aws:iam::123456789012:role/ExternalPartnerRole"
    },
    Resource={
        "TableWithColumns": {
            "DatabaseName": "sales_db",
            "Name": "customer_orders",
            # Explicit allowlist — only these columns are visible
            "ColumnNames": [
                "order_id",
                "order_date",
                "product_id",
                "quantity",
                "amount",
                "region",
            ],
        }
    },
    Permissions=["SELECT"],
    PermissionsWithGrantOption=[],
)

Row-Level Security — Data Filters and SQL Expressions

Data filters apply SQL WHERE expressions transparently to every query a principal runs against a table. The analyst writes a normal SELECT * FROM customer_orders — Lake Formation injects the filter before returning credentials to Athena. The analyst never sees the filter expression and cannot bypass it by reformulating the query. Apache Iceberg tables registered in the Glue Data Catalog work with Lake Formation data filters — row-level filtering operates on the Iceberg metadata layer, not the Parquet files directly, which preserves Iceberg partition pruning for performance.

# Create a row-level data filter — US region only
lf.create_data_cells_filter(
    TableData={
        "TableCatalogId": "123456789012",
        "DatabaseName": "sales_db",
        "TableName": "customer_orders",
        "Name": "us_region_only",
        "RowFilter": {
            # Standard SQL WHERE expression — injected into every query
            "FilterExpression": "region = 'US'",
        },
        # Also restrict to non-PII columns through the same filter
        "ColumnWildcard": {
            "ExcludedColumnNames": ["customer_email", "customer_phone"],
        },
    }
)

# Create filter for active orders — hide cancelled/draft rows
lf.create_data_cells_filter(
    TableData={
        "TableCatalogId": "123456789012",
        "DatabaseName": "sales_db",
        "TableName": "customer_orders",
        "Name": "active_orders_only",
        "RowFilter": {
            "FilterExpression": "status IN ('completed', 'processing', 'shipped')",
        },
        "ColumnWildcard": {},  # All columns through this filter
    }
)

# Multi-tenant filter — each tenant only sees their own data
# Create one filter per tenant (or use dynamic SQL via parameter)
lf.create_data_cells_filter(
    TableData={
        "TableCatalogId": "123456789012",
        "DatabaseName": "saas_db",
        "TableName": "events",
        "Name": "tenant_acme_only",
        "RowFilter": {
            "FilterExpression": "tenant_id = 'acme'",
        },
        "ColumnWildcard": {},
    }
)

# Grant the data filter to a principal
lf.grant_permissions(
    Principal={
        "DataLakePrincipalIdentifier": "arn:aws:iam::123456789012:role/USRegionAnalystRole"
    },
    Resource={
        "DataCellsFilter": {
            "TableCatalogId": "123456789012",
            "DatabaseName": "sales_db",
            "TableName": "customer_orders",
            "Name": "us_region_only",
        }
    },
    Permissions=["SELECT"],
    PermissionsWithGrantOption=[],
)

Note

Align row filter expressions with table partition keys to avoid full table scans. A filter on region = 'US' benefits from partition pruning only if region is a partition column — Athena can skip non-matching partitions at the metadata level. A filter on a non-partitioned column causes Athena to scan all partitions and then discard non-matching rows at the query engine level, adding scan cost and latency.

Column Masking — Protecting PII Without Removing Columns

Sometimes downstream consumers need to see that a column exists and has data, but the raw values must be protected. A common example is providing a masked email for display purposes while hiding the actual address, or computing aggregates on a salary column without exposing individual values. Lake Formation supports three masking approaches implemented as data filter patterns combined with Athena views.

-- Pattern 1: Athena view-based masking — hash the email, null the SSN
-- Create a masking view on top of the raw table

CREATE OR REPLACE VIEW sales_db.customer_orders_masked AS
SELECT
    order_id,
    customer_id,
    order_date,
    amount,
    region,
    product_id,
    status,
    -- MASK_HASH: replace with deterministic hash (join-able but not reversible)
    TO_HEX(SHA256(TO_UTF8(customer_email)))       AS customer_email,
    -- MASK_PARTIAL: show last 4 digits, mask the rest
    REGEXP_REPLACE(customer_phone, '\d(?=\d{4})', 'X') AS customer_phone,
    -- MASK_NULL: replace sensitive field with NULL entirely
    NULL                                           AS billing_address,
    NULL                                           AS ssn_hash
FROM sales_db.customer_orders;

-- Grant the view (not the raw table) to external partners
-- In Lake Formation, grant SELECT on the view's underlying table
-- via a column-level grant that excludes raw PII, then let Athena
-- apply the masking expressions through the view definition
# Pattern 2: Data filter with column subset + Athena view for masking logic
# Lake Formation handles row/column restriction; the view handles value masking

# Step 1 — create a filter that exposes only non-PII columns from the raw table
lf.create_data_cells_filter(
    TableData={
        "TableCatalogId": "123456789012",
        "DatabaseName": "sales_db",
        "TableName": "customers",
        "Name": "pii_columns_excluded",
        "RowFilter": {"AllRowsWildcard": {}},
        "ColumnWildcard": {
            "ExcludedColumnNames": ["ssn", "credit_card_number", "date_of_birth"],
        },
    }
)

# Step 2 — grant the filter to analyst role
lf.grant_permissions(
    Principal={
        "DataLakePrincipalIdentifier": "arn:aws:iam::123456789012:role/AnalystRole"
    },
    Resource={
        "DataCellsFilter": {
            "TableCatalogId": "123456789012",
            "DatabaseName": "sales_db",
            "TableName": "customers",
            "Name": "pii_columns_excluded",
        }
    },
    Permissions=["SELECT"],
    PermissionsWithGrantOption=[],
)

# Step 3 — for the email column (keep visible but mask value),
# create a separate masked_email Glue column via a Glue catalog update
# and use a Glue ETL job or Athena view to populate it
# The analyst queries the view; raw table access requires DataEngineerRole

Cross-Account Data Sharing — RAM and Lake Formation Combined

Cross-account Lake Formation sharing requires two steps that are easy to miss: sharing the Glue Catalog database via AWS Resource Access Manager (RAM) so the consumer account can see the catalog entry, and granting Lake Formation permissions to the consumer account ID so they can actually query data. Missing either step results in the consumer seeing the database in their catalog but getting AccessDeniedException on any table query.

# boto3 — Producer account: share a database cross-account

import boto3

lf_producer = boto3.client("lakeformation", region_name="us-east-1")
ram_producer = boto3.client("ram", region_name="us-east-1")

CONSUMER_ACCOUNT_ID = "987654321098"
PRODUCER_ACCOUNT_ID = "123456789012"

# Step 1: Grant Lake Formation permissions to the consumer account ID
lf_producer.grant_permissions(
    Principal={"DataLakePrincipalIdentifier": CONSUMER_ACCOUNT_ID},
    Resource={"Database": {"Name": "analytics_db"}},
    Permissions=["DESCRIBE"],
    PermissionsWithGrantOption=["DESCRIBE"],
)

lf_producer.grant_permissions(
    Principal={"DataLakePrincipalIdentifier": CONSUMER_ACCOUNT_ID},
    Resource={
        "Table": {
            "DatabaseName": "analytics_db",
            "TableWildcard": {},  # All current and future tables
        }
    },
    Permissions=["SELECT", "DESCRIBE"],
    PermissionsWithGrantOption=["SELECT", "DESCRIBE"],
)

# Step 2: Share the Glue Catalog database resource via RAM
ram_producer.create_resource_share(
    name="analytics-db-consumer-share",
    resourceArns=[
        f"arn:aws:glue:us-east-1:{PRODUCER_ACCOUNT_ID}:database/analytics_db"
    ],
    principals=[CONSUMER_ACCOUNT_ID],
    allowExternalPrincipals=False,  # Keep within AWS Organization
)
# boto3 — Consumer account: accept invitation and query

import boto3

ram_consumer = boto3.client("ram", region_name="us-east-1")
glue_consumer = boto3.client("glue", region_name="us-east-1")
lf_consumer = boto3.client("lakeformation", region_name="us-east-1")

# Step 3: Accept the RAM resource share invitation
invitations = ram_consumer.get_resource_share_invitations(
    resourceShareArns=[]
)["resourceShareInvitations"]
for invite in invitations:
    if invite["status"] == "PENDING":
        ram_consumer.accept_resource_share_invitation(
            resourceShareInvitationArn=invite["resourceShareInvitationArn"]
        )

# Step 4: Create a local Glue database pointing at the producer catalog
glue_consumer.create_database(
    DatabaseInput={
        "Name": "shared_analytics_db",
        "TargetDatabase": {
            "CatalogId": "123456789012",  # Producer account
            "DatabaseName": "analytics_db",
        },
    }
)

# Step 5: Grant the shared data to local IAM roles in consumer account
lf_consumer.grant_permissions(
    Principal={
        "DataLakePrincipalIdentifier": "arn:aws:iam::987654321098:role/ConsumerAnalystRole"
    },
    Resource={
        "Table": {
            "CatalogId": "123456789012",  # Producer catalog
            "DatabaseName": "analytics_db",
            "TableWildcard": {},
        }
    },
    Permissions=["SELECT", "DESCRIBE"],
    PermissionsWithGrantOption=[],
)

# Consumer analyst can now query in Athena:
# SELECT * FROM shared_analytics_db.revenue_summary WHERE quarter = '2026-Q1'

Integration with Athena, Redshift Spectrum, and Glue ETL

Lake Formation permissions are enforced transparently by all three query engines. Athena requires no special configuration — it calls Lake Formation automatically for any table in a registered location. Glue ETL jobs need an explicit SUPER or table-level grant to the Glue service role. Redshift Spectrum requires an external schema pointing at the Glue catalog with create external schema ... from data catalog and the Redshift IAM role must be granted Lake Formation permissions.

-- Athena — Lake Formation enforcement is automatic
-- Query engines do NOT need special configuration

-- This SELECT respects column exclusions and row filters automatically
SELECT
    order_id,
    customer_id,
    amount,
    region,
    order_date
FROM sales_db.customer_orders
WHERE order_date >= DATE '2026-01-01'
  AND status = 'completed'
ORDER BY amount DESC
LIMIT 1000;

-- Attempting to access excluded columns fails with:
-- COLUMN_NOT_FOUND: Column 'customer_email' cannot be resolved
-- SELECT customer_email FROM sales_db.customer_orders;

-- Verify which columns are visible to the current role
DESCRIBE sales_db.customer_orders;
-- Hidden columns do not appear in DESCRIBE output
# Grant Glue ETL job role access — required separately from Athena access

# The Glue service role needs Lake Formation table permissions
# It does NOT inherit permissions from the account admin

lf.grant_permissions(
    Principal={
        "DataLakePrincipalIdentifier": "arn:aws:iam::123456789012:role/GlueETLRole"
    },
    Resource={
        "Table": {
            "DatabaseName": "sales_db",
            "Name": "customer_orders",
        }
    },
    Permissions=["SELECT", "INSERT", "DELETE"],
    PermissionsWithGrantOption=[],
)

# Also grant the Glue role access to the output database/table
lf.grant_permissions(
    Principal={
        "DataLakePrincipalIdentifier": "arn:aws:iam::123456789012:role/GlueETLRole"
    },
    Resource={
        "Database": {"Name": "curated_db"},
    },
    Permissions=["CREATE_TABLE", "DESCRIBE"],
    PermissionsWithGrantOption=[],
)
-- Redshift Spectrum — create external schema backed by Glue/Lake Formation

-- The Redshift cluster's IAM role must have Lake Formation SELECT grants
-- and AmazonAthenaFullAccess + AWSGlueConsoleFullAccess policies

CREATE EXTERNAL SCHEMA lakeformation_sales
FROM DATA CATALOG
DATABASE 'sales_db'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftSpectrumRole'
REGION 'us-east-1';

-- After creation, query the schema normally
-- Lake Formation enforces column and row filters identically to Athena
SELECT
    product_id,
    SUM(amount) AS total_revenue,
    COUNT(*)    AS order_count
FROM lakeformation_sales.customer_orders
WHERE order_date BETWEEN '2026-01-01' AND '2026-03-31'
GROUP BY product_id
ORDER BY total_revenue DESC;

Audit Logging — CloudTrail and CloudWatch Logs Insights

Lake Formation logs all permission grants, revocations, and GetTemporaryGlueTableCredentials calls (one per Athena or Glue query) to CloudTrail. Enabling CloudTrail data events for the Lake Formation API gives you a per-query audit trail of which principal accessed which table and columns — required for SOC 2 and GDPR compliance audits.

# CloudWatch Logs Insights — Lake Formation access audit queries

# Find all Lake Formation AccessDeniedException errors in the last 7 days
# Log group: CloudTrail log group for the account

fields @timestamp, userIdentity.sessionContext.sessionIssuer.arn,
       requestParameters, errorCode, errorMessage
| filter eventSource = "lakeformation.amazonaws.com"
| filter errorCode = "AccessDeniedException"
| sort @timestamp desc
| limit 200

# Track table access frequency per principal (who queries what)
fields @timestamp, userIdentity.sessionContext.sessionIssuer.arn,
       requestParameters.tableArn
| filter eventSource = "lakeformation.amazonaws.com"
| filter eventName = "GetTemporaryGlueTableCredentials"
| stats count() as access_count by
    userIdentity.sessionContext.sessionIssuer.arn,
    requestParameters.tableArn
| sort access_count desc

# Alert on cross-account permission grants — potential data exfiltration
fields @timestamp, userIdentity.arn, requestParameters
| filter eventSource = "lakeformation.amazonaws.com"
| filter eventName IN ["GrantPermissions", "BatchGrantPermissions"]
| filter requestParameters.principal.dataLakePrincipalIdentifier not like "arn:aws:iam::123456789012"
| sort @timestamp desc

# Monitor LF-Tag assignment changes — catch unauthorized tag modifications
fields @timestamp, userIdentity.arn, eventName,
       requestParameters.lFTags, requestParameters.resource
| filter eventSource = "lakeformation.amazonaws.com"
| filter eventName IN ["AddLFTagsToResource", "RemoveLFTagsFromResource",
                        "CreateLFTag", "DeleteLFTag", "UpdateLFTag"]
| sort @timestamp desc

Production Checklist

1

Revoke the IAMAllowedPrincipals default grants on all existing Glue databases immediately after enabling Lake Formation. Until these are revoked, any IAM principal with Glue:GetTable can bypass Lake Formation entirely. Run `aws lakeformation batch-revoke-permissions` targeting the IAMAllowedPrincipals group on each database and table, or toggle this in the Lake Formation console under 'Settings > Edit' before registering any S3 locations.

2

Use LF-Tags for all governance policies except one-off exceptions. Direct table grants require a new grant per table as the catalog grows — LF-Tag policies automatically cover future tables that match the expression without any additional permission operations. Build a classification taxonomy (public / internal / confidential / restricted) and a domain taxonomy before tagging any tables.

3

Set create_database_default_permissions and create_table_default_permissions to empty arrays in aws_lakeformation_data_lake_settings. The AWS default grants new databases and tables to IAMAllowedPrincipals, re-creating the bypass vulnerability every time a new resource is created. Strict governance mode requires explicit grants for all resources.

4

Align row filter expressions with partition keys to avoid unnecessary full-table scans. A data filter on region='US' only prunes Athena partitions when region is a partition column. For unpartitioned tables, row filters incur a full scan cost even when matching rows are sparse — consider partitioning heavily-filtered tables by the most common filter predicate.

5

Never grant PermissionsWithGrantOption to service accounts or Glue ETL roles. Grant option means the principal can grant that permission to other principals — a compromised ETL role could escalate privileges to any principal it chooses. Restrict grant capability to human admin roles managed through IAM Identity Center.

6

Register S3 locations with a dedicated Lake Formation service role scoped to exactly the S3 prefixes needed. A service role with s3:GetObject on `arn:aws:s3:::*/*` would allow Lake Formation to generate credentials for any S3 path — use explicit bucket and prefix ARNs. Create one service role per data lake bucket with no cross-bucket access.

7

Test Glue ETL job permissions separately from Athena permissions under a non-admin test role. Glue jobs do not inherit from account-level admins — they need explicit Lake Formation grants to their service role. Permission errors in Glue ETL manifest as EntityNotFoundException on Glue catalog operations, not AccessDeniedException, which makes debugging non-obvious.

8

Use Terraform's aws_lakeformation_permissions resource with the principal-resource pair as the key. The resource manages the complete permission set for that pair — applying Terraform removes any permissions that exist in AWS but are absent from the config, making Lake Formation permissions fully declarative and auditable in version control.

9

Cross-account sharing requires both the RAM resource share acceptance AND the Lake Formation permission grant to the consumer account ID. The RAM share makes the Glue catalog database discoverable in the consumer account; the Lake Formation grant is what actually allows querying. One without the other produces 'database not found' or AccessDeniedException respectively — verify both are in place when debugging cross-account access.

10

Enable CloudTrail data events for lakeformation.amazonaws.com and store logs in a dedicated security S3 bucket with Object Lock enabled for tamper-proof audit retention. Without CloudTrail data events, you cannot audit which columns an Athena query accessed — the standard management events only log permission grants and revocations, not the query-level credential issuance that records column access.

Your S3 data lake relies entirely on IAM bucket policies so analysts see more columns than they should, you cannot enforce row-level access without maintaining separate tables per customer segment, or Lake Formation permissions are in place but queries still bypass them?

We design and implement Lake Formation governance platforms — from IAMAllowedPrincipals revocation and S3 location registration with dedicated service roles, through LF-Tag taxonomy design with classification and domain keys, LFTagPolicy grants eliminating per-table permission management as the catalog grows, column-level security with ColumnWildcard.ExcludedColumnNames for BI tool and analyst roles, row-level data filter design with partition-aligned SQL expressions for multi-tenant and regional access patterns, Terraform aws_lakeformation_permissions state management for version-controlled governance-as-code, cross-account RAM plus Lake Formation dual-grant wiring, Glue ETL job permission configuration separate from Athena access, and CloudTrail data events integration with CloudWatch Logs Insights for column-level access auditing and unauthorized grant detection. Let’s talk.

Let's Talk

Related Articles

DataSOps Consulting

Need help implementing this in production?

We build and operate data pipelines, AI systems, and observability stacks for engineering teams. Reach out for a free 30-minute architecture review.