Why Vanilla Terraform Breaks at Scale
A single terraform apply against a monolithic repository works fine for one team managing one environment. Add three teams, four environments, and fifty services and the cracks appear fast: blast radius from a single misconfiguration spans everything, state lock contention serializes work, and no one knows whether what Terraform thinks is deployed matches what is actually running.
The solutions are well-understood but require deliberate architectural choices: module versioning to decouple infrastructure building blocks from the teams that consume them, state file granularity to bound the blast radius of any single apply, drift detection to catch out-of-band changes before they cause incidents, and workflow automation to remove the footgun of manual applies from production.
| Symptom | Root Cause | Fix |
|---|---|---|
| One broken resource blocks all deploys | Monolithic state file | Split state by service/component |
| Teams step on each other's plans | Shared state lock | Separate state per team boundary |
| Copy-pasted module code across envs | No versioned modules | Publish modules to a registry |
| Production differs from what Terraform knows | Manual console changes | Drift detection in CI on a schedule |
| Repeated backend config in every module | No DRY tool | Terragrunt generate blocks |
Module Architecture: Root vs Child vs Published
Terraform modules exist at three levels of abstraction. Confusing these levels is the most common architectural mistake in large Terraform codebases.
Root modules
The entry point for terraform apply. Each root module manages one bounded context — a single service, a platform component, or a shared networking layer. Root modules call child and published modules but contain minimal logic of their own. They define the what (which modules, which versions, which inputs) rather than the how.
Child modules (local)
Internal modules in the same repository, referenced via relative path. Use for tightly coupled resources that are always deployed together (e.g., an ECS service alongside its IAM role and CloudWatch log group). Avoid reusing local modules across root module boundaries — that coupling creates hidden dependencies that break when either side changes.
Published modules (versioned)
Modules published to the Terraform Registry or a private registry (Terraform Cloud, Spacelift, or a Git tag-based convention). These are the reusable building blocks shared across teams. Pin to exact versions in all root modules — source = "terraform-aws-modules/vpc/aws" without a version constraint is a production incident waiting to happen.
A well-structured Terraform monorepo for a platform team might look like this:
infrastructure/
├── modules/ # published/shared modules (versioned via git tags)
│ ├── aws-ecs-service/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ ├── outputs.tf
│ │ └── versions.tf # required_providers pinned here
│ ├── aws-rds-postgres/
│ └── aws-alb-https/
│
├── live/ # root modules — one per environment × component
│ ├── prod/
│ │ ├── networking/ # root module: VPC, subnets, NAT
│ │ │ ├── main.tf
│ │ │ ├── terragrunt.hcl # remote state config injected by terragrunt
│ │ │ └── variables.tf
│ │ ├── platform/ # root module: ECS cluster, ALB, ECR
│ │ └── services/
│ │ ├── api/ # root module: one ECS service + RDS
│ │ └── worker/
│ ├── staging/
│ │ └── ... # mirrors prod structure, different var values
│ └── dev/
│ └── ...
│
├── terragrunt.hcl # root config: remote state backend template
└── common.hcl # shared inputs: account_id, region, tagsModule versioning uses Git tags. Every breaking change to a shared module increments the major version. Consumers pin to a version constraint and opt into upgrades deliberately:
# live/prod/services/api/main.tf
terraform {
required_version = ">= 1.7"
}
module "ecs_service" {
source = "git::https://github.com/your-org/terraform-modules.git//aws-ecs-service?ref=v3.2.0"
name = "api"
cluster_arn = data.terraform_remote_state.platform.outputs.ecs_cluster_arn
image = var.container_image
cpu = 512
memory = 1024
desired_count = var.desired_count
environment_variables = {
APP_ENV = var.env
LOG_LEVEL = "info"
}
# Secrets from SSM (module reads them at apply time, never in state plaintext)
secrets = {
DATABASE_URL = "/prod/api/database_url"
JWT_SECRET = "/prod/api/jwt_secret"
}
}
module "rds" {
source = "git::https://github.com/your-org/terraform-modules.git//aws-rds-postgres?ref=v2.1.0"
identifier = "api-prod"
instance_class = "db.t4g.medium"
engine_version = "16.2"
multi_az = true
deletion_protection = true
}Note
?ref=main or ?ref=HEAD as a module source in production root modules. Branch references are mutable — the same configuration can produce different infrastructure on different days if a module author pushes changes to main. Always pin to an immutable tag or commit SHA.Remote State Backends: S3 + DynamoDB
Local state is fine for personal projects. At any team scale you need remote state: a shared, durable store with locking so concurrent plans cannot corrupt the state file. The AWS standard is S3 for storage (versioned, encrypted) and DynamoDB for locking (a lock record is written at the start of each plan/apply and deleted on completion).
# Bootstrap: create the S3 bucket and DynamoDB table once, manually or with a separate "bootstrap" root module.
# These resources must exist before any other Terraform can run.
# bootstrap/main.tf
resource "aws_s3_bucket" "tfstate" {
bucket = "your-org-terraform-state"
lifecycle {
prevent_destroy = true # never let terraform destroy the state bucket
}
}
resource "aws_s3_bucket_versioning" "tfstate" {
bucket = aws_s3_bucket.tfstate.id
versioning_configuration {
status = "Enabled"
}
}
resource "aws_s3_bucket_server_side_encryption_configuration" "tfstate" {
bucket = aws_s3_bucket.tfstate.id
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "aws:kms"
}
}
}
resource "aws_s3_bucket_public_access_block" "tfstate" {
bucket = aws_s3_bucket.tfstate.id
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}
resource "aws_dynamodb_table" "tfstate_lock" {
name = "terraform-state-locks"
billing_mode = "PAY_PER_REQUEST"
hash_key = "LockID"
attribute {
name = "LockID"
type = "S"
}
lifecycle {
prevent_destroy = true
}
}With the bootstrap resources in place, each root module declares its backend. The key path encodes the full hierarchy — environment, component, and service — so every root module gets a unique, human-readable state file location:
# live/prod/services/api/backend.tf
# (or generated automatically by Terragrunt — see below)
terraform {
backend "s3" {
bucket = "your-org-terraform-state"
key = "prod/services/api/terraform.tfstate"
region = "eu-west-1"
encrypt = true
dynamodb_table = "terraform-state-locks"
# For cross-account setups: assume a role in the management account
# role_arn = "arn:aws:iam::123456789012:role/TerraformStateAccess"
}
}State Granularity: Splitting to Reduce Blast Radius
The single most impactful scaling decision in Terraform is choosing the right state granularity. Too coarse: a misconfigured security group in a 500-resource state file triggers a plan review that touches your VPC, your databases, and your entire ECS fleet simultaneously. Too fine: you have a separate root module for each tiny IAM policy and the overhead of cross-state references becomes unmanageable.
The practical rule: split state at natural change-rate and ownership boundaries. Networking changes rarely — a separate root module for VPC, subnets, and NAT. Platform infrastructure changes occasionally — ECS cluster, ALB, and ECR in one state. Application services change frequently — one root module per service.
Networking layer
VPC, subnets, route tables, NAT gateways, VPC peering, Transit Gateway attachments. Changes once per quarter. Blast radius: everything in the account if you get this wrong. Keep it isolated and gate changes tightly.
Platform layer
ECS/EKS cluster, ALB, ECR, shared IAM roles, service discovery namespaces, KMS keys. Changes when platform teams roll out new capabilities. Services depend on outputs from this layer via terraform_remote_state or SSM parameters.
Service layer
One root module per application service. Contains the ECS task definition, service, target group, listener rules, CloudWatch alarms, and service-specific IAM role. Changes on every deploy. Fast plans because the state is small (<50 resources).
Data layer
RDS instances, ElastiCache clusters, S3 buckets for application data, DynamoDB tables. Managed separately because data resources have the highest blast radius if destroyed. prevent_destroy = true lifecycle guards on all stateful resources.
Cross-layer references use terraform_remote_state data sources (or SSM Parameter Store for looser coupling). The networking layer publishes subnet IDs and VPC ID as outputs; the platform layer reads them without owning the networking resources:
# live/prod/platform/main.tf — read networking outputs without coupling state
data "terraform_remote_state" "networking" {
backend = "s3"
config = {
bucket = "your-org-terraform-state"
key = "prod/networking/terraform.tfstate"
region = "eu-west-1"
}
}
resource "aws_ecs_cluster" "main" {
name = "prod"
}
resource "aws_lb" "main" {
name = "prod-alb"
internal = false
load_balancer_type = "application"
# subnet IDs come from the networking state — platform team doesn't manage them
subnets = data.terraform_remote_state.networking.outputs.public_subnet_ids
security_groups = [aws_security_group.alb.id]
}
# Publish outputs for service layer to consume
output "ecs_cluster_arn" {
value = aws_ecs_cluster.main.arn
}
output "alb_listener_arn" {
value = aws_lb_listener.https.arn
}Note
terraform_remote_state for cross-team references when the teams own separate repositories. Remote state gives the consumer read access to the entire state file — including any sensitive outputs. SSM lets you expose exactly what you want to share, with IAM-controlled read access per parameter.Drift Detection: Catching Out-of-Band Changes
Drift is the gap between what Terraform's state file believes is running and what is actually running in your cloud account. It accumulates from emergency console fixes, automated AWS services modifying resources, and manual CLI operations that bypass the IaC workflow. Left undetected, drift causes "works in staging, breaks in production" failures — because staging was applied via Terraform but production was patched manually three months ago.
The detection mechanism is simple: terraform plan -detailed-exitcode exits with code 2 when there are pending changes, 0 when the state matches reality, and 1 on error. Run this on a schedule across all production root modules and alert when exit code 2 appears without a corresponding open pull request.
#!/usr/bin/env bash
# scripts/drift-check.sh — run terraform plan against all root modules and report drift
# Assumes: AWS credentials available via OIDC/instance role, terraform and terragrunt installed
set -euo pipefail
MODULES_DIR="live/prod"
SLACK_WEBHOOK="${SLACK_WEBHOOK_URL}" # set as CI secret
DRIFT_FOUND=0
find "${MODULES_DIR}" -name "terragrunt.hcl" | sort | while read -r tg_file; do
module_dir="$(dirname "${tg_file}")"
module_name="$(echo "${module_dir}" | sed "s|live/prod/||")"
echo "--- Checking ${module_name} ---"
# Run plan and capture exit code without failing the script
set +e
terragrunt plan \
--terragrunt-working-dir "${module_dir}" \
-detailed-exitcode \
-refresh=true \
-out=/dev/null \
2>&1 | tail -20
EXIT_CODE=$?
set -e
case "${EXIT_CODE}" in
0) echo " OK: no drift in ${module_name}" ;;
1) echo " ERROR: plan failed for ${module_name}"
DRIFT_FOUND=1 ;;
2) echo " DRIFT DETECTED: ${module_name} has pending changes"
DRIFT_FOUND=1
# Post to Slack
curl -s -X POST "${SLACK_WEBHOOK}" \
-H "Content-Type: application/json" \
-d "{"text": ":warning: Terraform drift detected in *${module_name}* — run \`terragrunt plan\` to review."}"
;;
esac
done
exit "${DRIFT_FOUND}"Schedule this script as a GitHub Actions workflow on a cron trigger (daily in non-peak hours). It runs read-only plan operations — no changes are applied — so it is safe to run against production with minimal IAM permissions (read-only + state read).
# .github/workflows/drift-detection.yml
name: Drift Detection
on:
schedule:
- cron: "0 6 * * 1-5" # 6 AM UTC, weekdays
workflow_dispatch: # allow manual trigger
permissions:
id-token: write # required for OIDC
contents: read
jobs:
drift-check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Configure AWS credentials (OIDC — no long-lived keys)
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ secrets.AWS_DRIFT_ROLE_ARN }}
aws-region: eu-west-1
- name: Install Terraform and Terragrunt
run: |
curl -Lo terraform.zip https://releases.hashicorp.com/terraform/1.8.5/terraform_1.8.5_linux_amd64.zip
unzip terraform.zip && sudo mv terraform /usr/local/bin/
curl -Lo terragrunt https://github.com/gruntwork-io/terragrunt/releases/download/v0.67.0/terragrunt_linux_amd64
chmod +x terragrunt && sudo mv terragrunt /usr/local/bin/
- name: Run drift detection
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_DRIFT_WEBHOOK }}
run: bash scripts/drift-check.shNote
configure-aws-credentials action with role-to-assume) instead of long-lived AWS access keys in GitHub Actions. OIDC tokens are short-lived and scoped to the specific workflow run — no credentials to rotate, no secrets to leak from CI logs. The IAM role needs ReadOnlyAccess plus S3 and DynamoDB access for the state backend.Terragrunt: DRY Infrastructure Across Environments
Terraform's backend configuration cannot use variables — the backend block is static HCL. This forces repetition: every root module must hardcode the S3 bucket name, DynamoDB table, and key prefix. Across 30 root modules in 3 environments that is 90 near-identical backend blocks to keep in sync.
Terragrunt solves this with generate blocks that write the backend configuration before terraform init, and include blocks that merge configuration from parent terragrunt.hcl files. The result: backend config defined once at the root, each module inherits it with a computed key based on its directory path.
# terragrunt.hcl — root config at repo root, inherited by all child modules
locals {
# Parse account and environment from the directory path:
# live/prod/services/api → env = prod
path_parts = split("/", path_relative_to_include())
environment = local.path_parts[1]
# Read shared config from a central file
common = read_terragrunt_config(find_in_parent_folders("common.hcl"))
account_id = local.common.locals.aws_account_ids[local.environment]
region = local.common.locals.aws_region
}
remote_state {
backend = "s3"
generate = {
path = "backend.tf"
if_exists = "overwrite_terragrunt"
}
config = {
bucket = "your-org-terraform-state-${local.account_id}"
key = "${path_relative_to_include()}/terraform.tfstate"
region = local.region
encrypt = true
dynamodb_table = "terraform-state-locks"
# Assume a role in the target account for cross-account deployments
role_arn = "arn:aws:iam::${local.account_id}:role/TerraformDeployRole"
}
}
# Inject common tags into every module's provider
generate "provider" {
path = "provider.tf"
if_exists = "overwrite_terragrunt"
contents = <<EOF
provider "aws" {
region = "${local.region}"
default_tags {
tags = {
Environment = "${local.environment}"
ManagedBy = "terraform"
Repository = "your-org/infrastructure"
}
}
}
EOF
}# common.hcl — shared values referenced by root terragrunt.hcl
locals {
aws_region = "eu-west-1"
aws_account_ids = {
dev = "111111111111"
staging = "222222222222"
prod = "333333333333"
}
}
# live/prod/services/api/terragrunt.hcl — per-module config (minimal)
include "root" {
path = find_in_parent_folders() # finds repo-root terragrunt.hcl
}
# Module-specific inputs — these override or extend root defaults
inputs = {
env = "prod"
desired_count = 4
cpu = 1024
memory = 2048
}
# Declare dependencies: terragrunt apply-all respects this ordering
dependency "platform" {
config_path = "../../platform"
mock_outputs = {
ecs_cluster_arn = "arn:aws:ecs:eu-west-1:333333333333:cluster/mock"
alb_listener_arn = "arn:aws:elasticloadbalancing:eu-west-1:333333333333:listener/mock"
}
mock_outputs_allowed_terraform_commands = ["validate", "plan"]
}With this setup, deploying the entire production environment in dependency order becomes a single command:
# Deploy all modules in prod in dependency order (Terragrunt resolves the DAG)
terragrunt run-all apply --terragrunt-working-dir live/prod
# Or plan first to review all changes:
terragrunt run-all plan --terragrunt-working-dir live/prod
# Deploy a specific module and its dependencies:
terragrunt apply --terragrunt-working-dir live/prod/services/apiCI/CD: Atlantis for Pull-Request-Driven Workflows
Manual terraform applyfrom a developer's laptop is the most common source of production incidents in IaC workflows. Local state can be stale, credentials may have excessive permissions, and there is no audit trail. The solution is to move all applies to CI, where plans are visible in pull requests and applies happen only after approval.
Atlantis is a self-hosted Terraform automation server that integrates directly with GitHub (and GitLab/Bitbucket). It listens for PR events, runs terraform plan automatically, posts the plan output as a PR comment, and runs terraform apply after an authorized team member comments atlantis apply. The full audit trail lives in the PR.
# atlantis.yaml — repository-level config for Atlantis
version: 3
automerge: false # never auto-merge PRs after apply
delete_source_branch_on_merge: false
projects:
# Atlantis auto-discovers changed directories and runs plan for each
# But for explicit control, list all root modules:
- name: prod-networking
dir: live/prod/networking
workspace: default
terragrunt: true
autoplan:
when_modified:
- "**/*.tf"
- "**/*.hcl"
- "../../modules/aws-vpc/**" # re-plan if shared module changes
enabled: true
apply_requirements:
- approved # require ≥1 PR approval
- mergeable # no merge conflicts
- undiverged # PR branch must be up to date
- name: prod-platform
dir: live/prod/platform
workspace: default
terragrunt: true
autoplan:
when_modified:
- "**/*.tf"
- "**/*.hcl"
enabled: true
apply_requirements:
- approved
- mergeable
- name: prod-services-api
dir: live/prod/services/api
workspace: default
terragrunt: true
autoplan:
when_modified:
- "**/*.tf"
- "**/*.hcl"
- "../../../modules/aws-ecs-service/**"
enabled: true
apply_requirements:
- approvedAtlantis runs as a deployment in your Kubernetes cluster or as an EC2 instance. It needs an IAM role with sufficient permissions and a GitHub webhook token. Here is a minimal Kubernetes deployment:
# kubernetes/atlantis-deployment.yaml (simplified)
apiVersion: apps/v1
kind: Deployment
metadata:
name: atlantis
namespace: platform
spec:
replicas: 1 # Atlantis is stateful — do not run multiple replicas
selector:
matchLabels:
app: atlantis
template:
spec:
serviceAccountName: atlantis # bound to IAM role via IRSA
containers:
- name: atlantis
image: ghcr.io/runatlantis/atlantis:v0.28.4
args:
- server
- --config=/etc/atlantis/config.yaml
env:
- name: ATLANTIS_GH_TOKEN
valueFrom:
secretKeyRef:
name: atlantis-secrets
key: github-token
- name: ATLANTIS_GH_WEBHOOK_SECRET
valueFrom:
secretKeyRef:
name: atlantis-secrets
key: webhook-secret
- name: ATLANTIS_REPO_ALLOWLIST
value: "github.com/your-org/infrastructure"
volumeMounts:
- name: config
mountPath: /etc/atlantis
volumes:
- name: config
configMap:
name: atlantis-configNote
Production Checklist
Enable state file versioning and point-in-time recovery
S3 versioning on the state bucket means every apply creates a new version. When an apply corrupts state (it happens), you recover by fetching the previous version with aws s3api get-object --version-id <id> and restoring it. Enable MFA Delete on the bucket to prevent accidental permanent deletion. Back up state to a secondary region via S3 replication.
Use prevent_destroy on all stateful resources
Set lifecycle { prevent_destroy = true } on every RDS instance, S3 bucket, DynamoDB table, and EFS filesystem. This makes Terraform refuse to destroy these resources even if they are removed from configuration — protecting against accidental data loss from a terraform apply of an incomplete PR. You must explicitly remove the lifecycle block and apply a second time to actually destroy.
Store secrets in SSM or Secrets Manager — never in state
Terraform state files are JSON and contain all resource attributes, including sensitive ones. If you create a database password as a random_password resource, that password is in the state file in plaintext. Use aws_ssm_parameter or aws_secretsmanager_secret_version and reference them in application code — never output raw secrets from Terraform modules.
Pin all provider and module versions
A required_providers block with exact version constraints in every module (not just root modules) ensures reproducible builds. The AWS provider regularly introduces breaking changes in major versions. Commit your .terraform.lock.hcl file — it pins provider checksums and makes init deterministic. Update providers deliberately through a PR, not accidentally during a terraform init -upgrade.
Enforce plan-before-apply in CI with approval gates
No one should run terraform apply directly in production. Route all applies through Atlantis or a GitHub Actions workflow where the plan output is visible to reviewers before apply runs. For production environments, require at least one approval from a senior engineer. Log every apply with the operator identity, PR link, and timestamp in a CloudTrail-compatible audit stream.
Run terraform validate and tflint in every PR
terraform validate catches syntax errors and type mismatches before plan runs. Pair it with TFLint for provider-aware linting (catches deprecated resource arguments, invalid instance types, missing required tags) and tfsec or Terrascan for security policy checks. These run in seconds and catch entire classes of misconfigurations before a plan is ever requested.
Managing cloud infrastructure with Terraform and hitting scaling walls?
We design and implement production-grade IaC architectures — from module versioning and state management to drift detection pipelines, Terragrunt setups, and Atlantis automation. Let’s talk.
Get in Touch