Back to Blog
TerraformIaCDevOpsCloudInfrastructureState Management

Terraform at Scale — Modules, State Management, and Drift Detection

A practical guide to running Terraform at scale: reusable module architecture with versioned registries, remote state backends with S3 and DynamoDB, state file granularity to reduce blast radius, drift detection in CI, Terragrunt for DRY configurations, and Atlantis for pull-request-driven apply workflows.

2026-04-22

Why Vanilla Terraform Breaks at Scale

A single terraform apply against a monolithic repository works fine for one team managing one environment. Add three teams, four environments, and fifty services and the cracks appear fast: blast radius from a single misconfiguration spans everything, state lock contention serializes work, and no one knows whether what Terraform thinks is deployed matches what is actually running.

The solutions are well-understood but require deliberate architectural choices: module versioning to decouple infrastructure building blocks from the teams that consume them, state file granularity to bound the blast radius of any single apply, drift detection to catch out-of-band changes before they cause incidents, and workflow automation to remove the footgun of manual applies from production.

SymptomRoot CauseFix
One broken resource blocks all deploysMonolithic state fileSplit state by service/component
Teams step on each other's plansShared state lockSeparate state per team boundary
Copy-pasted module code across envsNo versioned modulesPublish modules to a registry
Production differs from what Terraform knowsManual console changesDrift detection in CI on a schedule
Repeated backend config in every moduleNo DRY toolTerragrunt generate blocks

Module Architecture: Root vs Child vs Published

Terraform modules exist at three levels of abstraction. Confusing these levels is the most common architectural mistake in large Terraform codebases.

Root modules

The entry point for terraform apply. Each root module manages one bounded context — a single service, a platform component, or a shared networking layer. Root modules call child and published modules but contain minimal logic of their own. They define the what (which modules, which versions, which inputs) rather than the how.

Child modules (local)

Internal modules in the same repository, referenced via relative path. Use for tightly coupled resources that are always deployed together (e.g., an ECS service alongside its IAM role and CloudWatch log group). Avoid reusing local modules across root module boundaries — that coupling creates hidden dependencies that break when either side changes.

Published modules (versioned)

Modules published to the Terraform Registry or a private registry (Terraform Cloud, Spacelift, or a Git tag-based convention). These are the reusable building blocks shared across teams. Pin to exact versions in all root modules — source = "terraform-aws-modules/vpc/aws" without a version constraint is a production incident waiting to happen.

A well-structured Terraform monorepo for a platform team might look like this:

infrastructure/
├── modules/                     # published/shared modules (versioned via git tags)
│   ├── aws-ecs-service/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   ├── outputs.tf
│   │   └── versions.tf          # required_providers pinned here
│   ├── aws-rds-postgres/
│   └── aws-alb-https/
│
├── live/                        # root modules — one per environment × component
│   ├── prod/
│   │   ├── networking/          # root module: VPC, subnets, NAT
│   │   │   ├── main.tf
│   │   │   ├── terragrunt.hcl   # remote state config injected by terragrunt
│   │   │   └── variables.tf
│   │   ├── platform/            # root module: ECS cluster, ALB, ECR
│   │   └── services/
│   │       ├── api/             # root module: one ECS service + RDS
│   │       └── worker/
│   ├── staging/
│   │   └── ...                  # mirrors prod structure, different var values
│   └── dev/
│       └── ...
│
├── terragrunt.hcl               # root config: remote state backend template
└── common.hcl                   # shared inputs: account_id, region, tags

Module versioning uses Git tags. Every breaking change to a shared module increments the major version. Consumers pin to a version constraint and opt into upgrades deliberately:

# live/prod/services/api/main.tf

terraform {
  required_version = ">= 1.7"
}

module "ecs_service" {
  source  = "git::https://github.com/your-org/terraform-modules.git//aws-ecs-service?ref=v3.2.0"

  name           = "api"
  cluster_arn    = data.terraform_remote_state.platform.outputs.ecs_cluster_arn
  image          = var.container_image
  cpu            = 512
  memory         = 1024
  desired_count  = var.desired_count

  environment_variables = {
    APP_ENV    = var.env
    LOG_LEVEL  = "info"
  }

  # Secrets from SSM (module reads them at apply time, never in state plaintext)
  secrets = {
    DATABASE_URL = "/prod/api/database_url"
    JWT_SECRET   = "/prod/api/jwt_secret"
  }
}

module "rds" {
  source  = "git::https://github.com/your-org/terraform-modules.git//aws-rds-postgres?ref=v2.1.0"

  identifier      = "api-prod"
  instance_class  = "db.t4g.medium"
  engine_version  = "16.2"
  multi_az        = true
  deletion_protection = true
}

Note

Never use ?ref=main or ?ref=HEAD as a module source in production root modules. Branch references are mutable — the same configuration can produce different infrastructure on different days if a module author pushes changes to main. Always pin to an immutable tag or commit SHA.

Remote State Backends: S3 + DynamoDB

Local state is fine for personal projects. At any team scale you need remote state: a shared, durable store with locking so concurrent plans cannot corrupt the state file. The AWS standard is S3 for storage (versioned, encrypted) and DynamoDB for locking (a lock record is written at the start of each plan/apply and deleted on completion).

# Bootstrap: create the S3 bucket and DynamoDB table once, manually or with a separate "bootstrap" root module.
# These resources must exist before any other Terraform can run.

# bootstrap/main.tf
resource "aws_s3_bucket" "tfstate" {
  bucket = "your-org-terraform-state"

  lifecycle {
    prevent_destroy = true   # never let terraform destroy the state bucket
  }
}

resource "aws_s3_bucket_versioning" "tfstate" {
  bucket = aws_s3_bucket.tfstate.id
  versioning_configuration {
    status = "Enabled"
  }
}

resource "aws_s3_bucket_server_side_encryption_configuration" "tfstate" {
  bucket = aws_s3_bucket.tfstate.id
  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm = "aws:kms"
    }
  }
}

resource "aws_s3_bucket_public_access_block" "tfstate" {
  bucket                  = aws_s3_bucket.tfstate.id
  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

resource "aws_dynamodb_table" "tfstate_lock" {
  name         = "terraform-state-locks"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "LockID"

  attribute {
    name = "LockID"
    type = "S"
  }

  lifecycle {
    prevent_destroy = true
  }
}

With the bootstrap resources in place, each root module declares its backend. The key path encodes the full hierarchy — environment, component, and service — so every root module gets a unique, human-readable state file location:

# live/prod/services/api/backend.tf
# (or generated automatically by Terragrunt — see below)

terraform {
  backend "s3" {
    bucket         = "your-org-terraform-state"
    key            = "prod/services/api/terraform.tfstate"
    region         = "eu-west-1"
    encrypt        = true
    dynamodb_table = "terraform-state-locks"

    # For cross-account setups: assume a role in the management account
    # role_arn = "arn:aws:iam::123456789012:role/TerraformStateAccess"
  }
}

State Granularity: Splitting to Reduce Blast Radius

The single most impactful scaling decision in Terraform is choosing the right state granularity. Too coarse: a misconfigured security group in a 500-resource state file triggers a plan review that touches your VPC, your databases, and your entire ECS fleet simultaneously. Too fine: you have a separate root module for each tiny IAM policy and the overhead of cross-state references becomes unmanageable.

The practical rule: split state at natural change-rate and ownership boundaries. Networking changes rarely — a separate root module for VPC, subnets, and NAT. Platform infrastructure changes occasionally — ECS cluster, ALB, and ECR in one state. Application services change frequently — one root module per service.

1

Networking layer

VPC, subnets, route tables, NAT gateways, VPC peering, Transit Gateway attachments. Changes once per quarter. Blast radius: everything in the account if you get this wrong. Keep it isolated and gate changes tightly.

2

Platform layer

ECS/EKS cluster, ALB, ECR, shared IAM roles, service discovery namespaces, KMS keys. Changes when platform teams roll out new capabilities. Services depend on outputs from this layer via terraform_remote_state or SSM parameters.

3

Service layer

One root module per application service. Contains the ECS task definition, service, target group, listener rules, CloudWatch alarms, and service-specific IAM role. Changes on every deploy. Fast plans because the state is small (<50 resources).

4

Data layer

RDS instances, ElastiCache clusters, S3 buckets for application data, DynamoDB tables. Managed separately because data resources have the highest blast radius if destroyed. prevent_destroy = true lifecycle guards on all stateful resources.

Cross-layer references use terraform_remote_state data sources (or SSM Parameter Store for looser coupling). The networking layer publishes subnet IDs and VPC ID as outputs; the platform layer reads them without owning the networking resources:

# live/prod/platform/main.tf — read networking outputs without coupling state

data "terraform_remote_state" "networking" {
  backend = "s3"
  config = {
    bucket = "your-org-terraform-state"
    key    = "prod/networking/terraform.tfstate"
    region = "eu-west-1"
  }
}

resource "aws_ecs_cluster" "main" {
  name = "prod"
}

resource "aws_lb" "main" {
  name               = "prod-alb"
  internal           = false
  load_balancer_type = "application"

  # subnet IDs come from the networking state — platform team doesn't manage them
  subnets = data.terraform_remote_state.networking.outputs.public_subnet_ids

  security_groups = [aws_security_group.alb.id]
}

# Publish outputs for service layer to consume
output "ecs_cluster_arn" {
  value = aws_ecs_cluster.main.arn
}

output "alb_listener_arn" {
  value = aws_lb_listener.https.arn
}

Note

Prefer SSM Parameter Store over terraform_remote_state for cross-team references when the teams own separate repositories. Remote state gives the consumer read access to the entire state file — including any sensitive outputs. SSM lets you expose exactly what you want to share, with IAM-controlled read access per parameter.

Drift Detection: Catching Out-of-Band Changes

Drift is the gap between what Terraform's state file believes is running and what is actually running in your cloud account. It accumulates from emergency console fixes, automated AWS services modifying resources, and manual CLI operations that bypass the IaC workflow. Left undetected, drift causes "works in staging, breaks in production" failures — because staging was applied via Terraform but production was patched manually three months ago.

The detection mechanism is simple: terraform plan -detailed-exitcode exits with code 2 when there are pending changes, 0 when the state matches reality, and 1 on error. Run this on a schedule across all production root modules and alert when exit code 2 appears without a corresponding open pull request.

#!/usr/bin/env bash
# scripts/drift-check.sh — run terraform plan against all root modules and report drift
# Assumes: AWS credentials available via OIDC/instance role, terraform and terragrunt installed

set -euo pipefail

MODULES_DIR="live/prod"
SLACK_WEBHOOK="${SLACK_WEBHOOK_URL}"  # set as CI secret
DRIFT_FOUND=0

find "${MODULES_DIR}" -name "terragrunt.hcl" | sort | while read -r tg_file; do
  module_dir="$(dirname "${tg_file}")"
  module_name="$(echo "${module_dir}" | sed "s|live/prod/||")"

  echo "--- Checking ${module_name} ---"

  # Run plan and capture exit code without failing the script
  set +e
  terragrunt plan \
    --terragrunt-working-dir "${module_dir}" \
    -detailed-exitcode \
    -refresh=true \
    -out=/dev/null \
    2>&1 | tail -20
  EXIT_CODE=$?
  set -e

  case "${EXIT_CODE}" in
    0) echo "  OK: no drift in ${module_name}" ;;
    1) echo "  ERROR: plan failed for ${module_name}"
       DRIFT_FOUND=1 ;;
    2) echo "  DRIFT DETECTED: ${module_name} has pending changes"
       DRIFT_FOUND=1
       # Post to Slack
       curl -s -X POST "${SLACK_WEBHOOK}" \
         -H "Content-Type: application/json" \
         -d "{"text": ":warning: Terraform drift detected in *${module_name}* — run \`terragrunt plan\` to review."}"
       ;;
  esac
done

exit "${DRIFT_FOUND}"

Schedule this script as a GitHub Actions workflow on a cron trigger (daily in non-peak hours). It runs read-only plan operations — no changes are applied — so it is safe to run against production with minimal IAM permissions (read-only + state read).

# .github/workflows/drift-detection.yml

name: Drift Detection

on:
  schedule:
    - cron: "0 6 * * 1-5"   # 6 AM UTC, weekdays
  workflow_dispatch:          # allow manual trigger

permissions:
  id-token: write             # required for OIDC
  contents: read

jobs:
  drift-check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Configure AWS credentials (OIDC — no long-lived keys)
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ secrets.AWS_DRIFT_ROLE_ARN }}
          aws-region: eu-west-1

      - name: Install Terraform and Terragrunt
        run: |
          curl -Lo terraform.zip https://releases.hashicorp.com/terraform/1.8.5/terraform_1.8.5_linux_amd64.zip
          unzip terraform.zip && sudo mv terraform /usr/local/bin/
          curl -Lo terragrunt https://github.com/gruntwork-io/terragrunt/releases/download/v0.67.0/terragrunt_linux_amd64
          chmod +x terragrunt && sudo mv terragrunt /usr/local/bin/

      - name: Run drift detection
        env:
          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_DRIFT_WEBHOOK }}
        run: bash scripts/drift-check.sh

Note

Use OIDC authentication (the configure-aws-credentials action with role-to-assume) instead of long-lived AWS access keys in GitHub Actions. OIDC tokens are short-lived and scoped to the specific workflow run — no credentials to rotate, no secrets to leak from CI logs. The IAM role needs ReadOnlyAccess plus S3 and DynamoDB access for the state backend.

Terragrunt: DRY Infrastructure Across Environments

Terraform's backend configuration cannot use variables — the backend block is static HCL. This forces repetition: every root module must hardcode the S3 bucket name, DynamoDB table, and key prefix. Across 30 root modules in 3 environments that is 90 near-identical backend blocks to keep in sync.

Terragrunt solves this with generate blocks that write the backend configuration before terraform init, and include blocks that merge configuration from parent terragrunt.hcl files. The result: backend config defined once at the root, each module inherits it with a computed key based on its directory path.

# terragrunt.hcl — root config at repo root, inherited by all child modules

locals {
  # Parse account and environment from the directory path:
  # live/prod/services/api → env = prod
  path_parts  = split("/", path_relative_to_include())
  environment = local.path_parts[1]

  # Read shared config from a central file
  common = read_terragrunt_config(find_in_parent_folders("common.hcl"))

  account_id = local.common.locals.aws_account_ids[local.environment]
  region     = local.common.locals.aws_region
}

remote_state {
  backend = "s3"
  generate = {
    path      = "backend.tf"
    if_exists = "overwrite_terragrunt"
  }
  config = {
    bucket         = "your-org-terraform-state-${local.account_id}"
    key            = "${path_relative_to_include()}/terraform.tfstate"
    region         = local.region
    encrypt        = true
    dynamodb_table = "terraform-state-locks"

    # Assume a role in the target account for cross-account deployments
    role_arn = "arn:aws:iam::${local.account_id}:role/TerraformDeployRole"
  }
}

# Inject common tags into every module's provider
generate "provider" {
  path      = "provider.tf"
  if_exists = "overwrite_terragrunt"
  contents  = <<EOF
provider "aws" {
  region = "${local.region}"
  default_tags {
    tags = {
      Environment = "${local.environment}"
      ManagedBy   = "terraform"
      Repository  = "your-org/infrastructure"
    }
  }
}
EOF
}
# common.hcl — shared values referenced by root terragrunt.hcl

locals {
  aws_region = "eu-west-1"

  aws_account_ids = {
    dev     = "111111111111"
    staging = "222222222222"
    prod    = "333333333333"
  }
}

# live/prod/services/api/terragrunt.hcl — per-module config (minimal)

include "root" {
  path = find_in_parent_folders()   # finds repo-root terragrunt.hcl
}

# Module-specific inputs — these override or extend root defaults
inputs = {
  env           = "prod"
  desired_count = 4
  cpu           = 1024
  memory        = 2048
}

# Declare dependencies: terragrunt apply-all respects this ordering
dependency "platform" {
  config_path = "../../platform"
  mock_outputs = {
    ecs_cluster_arn  = "arn:aws:ecs:eu-west-1:333333333333:cluster/mock"
    alb_listener_arn = "arn:aws:elasticloadbalancing:eu-west-1:333333333333:listener/mock"
  }
  mock_outputs_allowed_terraform_commands = ["validate", "plan"]
}

With this setup, deploying the entire production environment in dependency order becomes a single command:

# Deploy all modules in prod in dependency order (Terragrunt resolves the DAG)
terragrunt run-all apply --terragrunt-working-dir live/prod

# Or plan first to review all changes:
terragrunt run-all plan --terragrunt-working-dir live/prod

# Deploy a specific module and its dependencies:
terragrunt apply --terragrunt-working-dir live/prod/services/api

CI/CD: Atlantis for Pull-Request-Driven Workflows

Manual terraform applyfrom a developer's laptop is the most common source of production incidents in IaC workflows. Local state can be stale, credentials may have excessive permissions, and there is no audit trail. The solution is to move all applies to CI, where plans are visible in pull requests and applies happen only after approval.

Atlantis is a self-hosted Terraform automation server that integrates directly with GitHub (and GitLab/Bitbucket). It listens for PR events, runs terraform plan automatically, posts the plan output as a PR comment, and runs terraform apply after an authorized team member comments atlantis apply. The full audit trail lives in the PR.

# atlantis.yaml — repository-level config for Atlantis

version: 3
automerge: false        # never auto-merge PRs after apply
delete_source_branch_on_merge: false

projects:
  # Atlantis auto-discovers changed directories and runs plan for each
  # But for explicit control, list all root modules:

  - name: prod-networking
    dir: live/prod/networking
    workspace: default
    terragrunt: true
    autoplan:
      when_modified:
        - "**/*.tf"
        - "**/*.hcl"
        - "../../modules/aws-vpc/**"   # re-plan if shared module changes
      enabled: true
    apply_requirements:
      - approved                        # require ≥1 PR approval
      - mergeable                       # no merge conflicts
      - undiverged                      # PR branch must be up to date

  - name: prod-platform
    dir: live/prod/platform
    workspace: default
    terragrunt: true
    autoplan:
      when_modified:
        - "**/*.tf"
        - "**/*.hcl"
      enabled: true
    apply_requirements:
      - approved
      - mergeable

  - name: prod-services-api
    dir: live/prod/services/api
    workspace: default
    terragrunt: true
    autoplan:
      when_modified:
        - "**/*.tf"
        - "**/*.hcl"
        - "../../../modules/aws-ecs-service/**"
      enabled: true
    apply_requirements:
      - approved

Atlantis runs as a deployment in your Kubernetes cluster or as an EC2 instance. It needs an IAM role with sufficient permissions and a GitHub webhook token. Here is a minimal Kubernetes deployment:

# kubernetes/atlantis-deployment.yaml (simplified)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: atlantis
  namespace: platform
spec:
  replicas: 1    # Atlantis is stateful — do not run multiple replicas
  selector:
    matchLabels:
      app: atlantis
  template:
    spec:
      serviceAccountName: atlantis   # bound to IAM role via IRSA
      containers:
        - name: atlantis
          image: ghcr.io/runatlantis/atlantis:v0.28.4
          args:
            - server
            - --config=/etc/atlantis/config.yaml
          env:
            - name: ATLANTIS_GH_TOKEN
              valueFrom:
                secretKeyRef:
                  name: atlantis-secrets
                  key: github-token
            - name: ATLANTIS_GH_WEBHOOK_SECRET
              valueFrom:
                secretKeyRef:
                  name: atlantis-secrets
                  key: webhook-secret
            - name: ATLANTIS_REPO_ALLOWLIST
              value: "github.com/your-org/infrastructure"
          volumeMounts:
            - name: config
              mountPath: /etc/atlantis
      volumes:
        - name: config
          configMap:
            name: atlantis-config

Note

Run Atlantis with exactly one replica. It maintains a local working directory for each active plan/apply — multiple replicas running the same operation simultaneously will corrupt state. For high availability, use a PersistentVolumeClaim for the working directory and rely on Kubernetes pod restarts rather than horizontal scaling.

Production Checklist

Enable state file versioning and point-in-time recovery

S3 versioning on the state bucket means every apply creates a new version. When an apply corrupts state (it happens), you recover by fetching the previous version with aws s3api get-object --version-id <id> and restoring it. Enable MFA Delete on the bucket to prevent accidental permanent deletion. Back up state to a secondary region via S3 replication.

Use prevent_destroy on all stateful resources

Set lifecycle { prevent_destroy = true } on every RDS instance, S3 bucket, DynamoDB table, and EFS filesystem. This makes Terraform refuse to destroy these resources even if they are removed from configuration — protecting against accidental data loss from a terraform apply of an incomplete PR. You must explicitly remove the lifecycle block and apply a second time to actually destroy.

Store secrets in SSM or Secrets Manager — never in state

Terraform state files are JSON and contain all resource attributes, including sensitive ones. If you create a database password as a random_password resource, that password is in the state file in plaintext. Use aws_ssm_parameter or aws_secretsmanager_secret_version and reference them in application code — never output raw secrets from Terraform modules.

Pin all provider and module versions

A required_providers block with exact version constraints in every module (not just root modules) ensures reproducible builds. The AWS provider regularly introduces breaking changes in major versions. Commit your .terraform.lock.hcl file — it pins provider checksums and makes init deterministic. Update providers deliberately through a PR, not accidentally during a terraform init -upgrade.

Enforce plan-before-apply in CI with approval gates

No one should run terraform apply directly in production. Route all applies through Atlantis or a GitHub Actions workflow where the plan output is visible to reviewers before apply runs. For production environments, require at least one approval from a senior engineer. Log every apply with the operator identity, PR link, and timestamp in a CloudTrail-compatible audit stream.

Run terraform validate and tflint in every PR

terraform validate catches syntax errors and type mismatches before plan runs. Pair it with TFLint for provider-aware linting (catches deprecated resource arguments, invalid instance types, missing required tags) and tfsec or Terrascan for security policy checks. These run in seconds and catch entire classes of misconfigurations before a plan is ever requested.

Managing cloud infrastructure with Terraform and hitting scaling walls?

We design and implement production-grade IaC architectures — from module versioning and state management to drift detection pipelines, Terragrunt setups, and Atlantis automation. Let’s talk.

Get in Touch

Related Articles