The Infrastructure Automation Stack That Actually Scales: Terraform, Terragrunt and Atlantis

There’s a point in every platform team’s journey where the basic Terraform setup stops working. Not “stops working” in the sense that it breaks—it just stops scaling with you. You end up with sprawling CI/CD configurations, copy-pasted variable blocks, and that creeping dread every time someone asks “can we spin up another environment?”

This post is about moving beyond that. Specifically, it’s about combining Terraform, Terragrunt, and Atlantis into a setup that handles real-world infrastructure complexity without drowning in configuration sprawl.

The Problem with Vanilla Terraform at Scale

Terraform is excellent. The provider ecosystem is unmatched, the state management is solid, and HCL strikes a reasonable balance between readability and expressiveness. But Terraform alone leaves you solving the same problems repeatedly:

How do you share configuration across environments without copy-pasting?
How do you express that module A must be applied before module B?
How do you ensure consistent variable hierarchies across dozens of modules?
How do you review infrastructure changes before they hit production?

The typical answer is “wrap it in CI/CD scripts.” You end up with bespoke GitHub Actions workflows, Jenkins pipelines with embedded Terraform commands, or some combination that works but requires tribal knowledge to maintain.

There’s a better way.

Terragrunt: Configuration Hierarchy Done Right

Terragrunt sits between your Terraform modules and your execution layer. Its primary value proposition is allowing you to define configuration hierarchies that cascade down through your infrastructure.

Consider a typical repository structure:

infrastructure/
├── root.yml
├── eu-west-1/
│   ├── region.yml
│   ├── production/
│   │   ├── environment.yml
│   │   ├── vpc/
│   │   │   └── terragrunt.hcl
│   │   ├── rds/
│   │   │   └── terragrunt.hcl
│   │   └── ecs-cluster/
│   │       └── terragrunt.hcl
│   └── staging/
│       ├── environment.yml
│       ├── vpc/
│       │   └── terragrunt.hcl
│       └── rds/
│           └── terragrunt.hcl
└── us-east-1/
    └── ...

The root.yml contains organisation-wide defaults:

# root.yml
region: eu-west-1
availability_zones:
  - eu-west-1a
  - eu-west-1b
  - eu-west-1c

default_tags:
  managed_by: terraform
  repository: infrastructure-monorepo

# Common instance sizing
rds_instance_class: db.r6g.large
cache_node_type: cache.r6g.large

# Version pins
postgres_engine_version: "15.4"
redis_engine_version: "7.0"

Region-level configuration extends this:

# eu-west-1/region.yml
region: eu-west-1
availability_zones:
  - eu-west-1a
  - eu-west-1b
  - eu-west-1c

Environment-level configuration adds specifics:

# eu-west-1/production/environment.yml
environment: production
rds_instance_class: db.r6g.xlarge  # Override for production
enable_deletion_protection: true
backup_retention_period: 30

The terragrunt.hcl files then consume this hierarchy:

# eu-west-1/production/rds/terragrunt.hcl
include "root" {
  path = find_in_parent_folders("root.hcl")
}

locals {
  root_config   = yamldecode(file(find_in_parent_folders("root.yml")))
  region_config = yamldecode(file(find_in_parent_folders("region.yml")))
  env_config    = yamldecode(file("${get_parent_terragrunt_dir()}/environment.yml"))
  
  # Merge with precedence: env > region > root
  config = merge(
    local.root_config,
    local.region_config,
    local.env_config
  )
}

dependency "vpc" {
  config_path = "../vpc"
  
  mock_outputs = {
    vpc_id             = "vpc-mock"
    private_subnet_ids = ["subnet-mock-1", "subnet-mock-2"]
  }
}

terraform {
  source = "git::git@github.com:your-org/terraform-modules.git//rds?ref=v2.3.0"
}

inputs = {
  identifier              = "main-${local.config.environment}"
  engine_version          = local.config.postgres_engine_version
  instance_class          = local.config.rds_instance_class
  vpc_id                  = dependency.vpc.outputs.vpc_id
  subnet_ids              = dependency.vpc.outputs.private_subnet_ids
  deletion_protection     = local.config.enable_deletion_protection
  backup_retention_period = local.config.backup_retention_period
  availability_zones      = local.config.availability_zones
  tags                    = local.config.default_tags
}

This approach provides several advantages:

Configuration DRYness: Change the PostgreSQL version in root.yml and it propagates everywhere. Need to override for a specific environment? Add it at that level and the merge handles precedence.

Explicit dependencies: The dependency block tells Terragrunt (and later, Atlantis) that the RDS module depends on the VPC module. This isn’t just documentation—it’s enforced ordering.

Module versioning: Each terragrunt.hcl pins to a specific module version. Rolling out a module update becomes a deliberate per-environment choice rather than an implicit “whatever’s in main.”

Orchestrating Multiple Dependencies

Real infrastructure has complex dependency graphs. Terragrunt handles this elegantly:

# eu-west-1/production/ecs-service/terragrunt.hcl
dependency "vpc" {
  config_path = "../vpc"
}

dependency "ecs_cluster" {
  config_path = "../ecs-cluster"
}

dependency "rds" {
  config_path = "../rds"
}

dependency "elasticache" {
  config_path = "../elasticache"
}

dependency "alb" {
  config_path = "../alb"
}

inputs = {
  cluster_arn         = dependency.ecs_cluster.outputs.cluster_arn
  vpc_id              = dependency.vpc.outputs.vpc_id
  subnet_ids          = dependency.vpc.outputs.private_subnet_ids
  database_endpoint   = dependency.rds.outputs.endpoint
  redis_endpoint      = dependency.elasticache.outputs.primary_endpoint
  target_group_arn    = dependency.alb.outputs.target_group_arn
}

When you run terragrunt run-all apply, it builds the dependency graph and executes in the correct order. When you run terragrunt run-all destroy, it reverses the order automatically.

Atlantis: Shifting the Plan/Apply Cycle

Atlantis fundamentally changes how infrastructure changes flow through your team. The traditional workflow looks like:

Engineer writes Terraform changes locally
Engineer runs terraform plan locally (maybe)
PR is opened, reviewed, merged
CI/CD pipeline runs terraform plan then terraform apply
Everyone hopes the apply matches what was reviewed

Atlantis inverts this:

Engineer writes Terraform changes
PR is opened
Atlantis automatically runs terraform plan and posts results to the PR
Reviewers see the actual plan output alongside the code
After approval, atlantis apply is run (still pre-merge)
PR is merged with the apply already complete

This is a significant mental shift. The infrastructure change happens before the merge, not after. The PR becomes a record of what was changed and the apply output, not a promise of what will change.

The benefits are substantial:

Drift prevention: The state is updated immediately upon apply. There’s no window between “merge” and “apply” where the state can drift.

Review accuracy: Reviewers see the actual plan, not just the code. A one-line change that triggers a replacement is immediately visible.

Rollback clarity: If an apply causes issues, the PR is still open. You can see exactly what was applied and revert the code change knowing the state reflects reality.

Audit trail: Every infrastructure change has a corresponding PR with the full plan and apply output preserved.

The Secret Sauce: terragrunt-atlantis-config

Here’s where it gets interesting. Atlantis needs to know which Terraform projects exist and how they relate to each other. With vanilla Terraform, you maintain an atlantis.yaml manually:

# atlantis.yaml (manual approach - don't do this)
version: 3
projects:
  - name: vpc-production
    dir: eu-west-1/production/vpc
    workflow: terragrunt
  - name: rds-production
    dir: eu-west-1/production/rds
    workflow: terragrunt
    depends_on:
      - vpc-production
  # ... repeat for every project

This doesn’t scale. Every new module requires a manual update. Dependencies are easily forgotten. It’s exactly the kind of toil that leads to configuration rot.

terragrunt-atlantis-config solves this by generating the atlantis.yaml from your Terragrunt configuration. It reads the dependency blocks you’ve already defined and produces the correct Atlantis configuration automatically.

Run as a pre-workflow hook in Atlantis:

# atlantis.yaml
version: 3
automerge: false
parallel_plan: true
parallel_apply: false

workflows:
  terragrunt:
    plan:
      steps:
        - env:
            name: TERRAGRUNT_TFPATH
            command: 'echo "terraform"'
        - run: terragrunt-atlantis-config generate --output atlantis.yaml --autoplan --parallel --create-project-name
        - run: terragrunt plan -no-color -out=$PLANFILE
    apply:
      steps:
        - run: terragrunt apply -no-color $PLANFILE

The generated configuration captures all the relationships:

# Generated atlantis.yaml
version: 3
projects:
  - name: eu-west-1_production_vpc
    dir: eu-west-1/production/vpc
    workflow: terragrunt
    autoplan:
      when_modified:
        - "*.hcl"
        - "*.tf"
        - "*.yml"
      enabled: true
      
  - name: eu-west-1_production_rds
    dir: eu-west-1/production/rds
    workflow: terragrunt
    autoplan:
      when_modified:
        - "*.hcl"
        - "*.tf"
        - "*.yml"
      enabled: true
    depends_on:
      - eu-west-1_production_vpc

Expressing Manual Dependencies

The automatic dependency detection from dependency blocks handles most cases. But what about that root.yml at the top of your hierarchy? Changing the PostgreSQL version there should trigger plans for every RDS instance.

terragrunt-atlantis-config supports this through explicit configuration:

# root.hcl
locals {
  # This block tells terragrunt-atlantis-config about extra dependencies
  atlantis_extra_deps = [
    "root.yml",
    "common/*.yml"
  ]
}

Now any change to root.yml will trigger plans for all projects that include root.hcl. This is the hierarchical approach in action—common configuration at the top cascades automatically.

You can also express relationships at the project level:

# eu-west-1/production/rds/terragrunt.hcl
locals {
  atlantis_extra_deps = [
    find_in_parent_folders("root.yml"),
    find_in_parent_folders("region.yml"),
    "${get_parent_terragrunt_dir()}/environment.yml"
  ]
}

This ensures that changes to any configuration file in the hierarchy trigger the appropriate plans.

When root.yml Changes: The Cascade Effect

There’s an important trade-off to understand here. When root.yml is associated with many Terragrunt configurations, changing it triggers plans for all of them. Change the PostgreSQL version? Every RDS-related project gets planned. Change the default tags? Everything gets planned.

This is by design—it’s exactly the consistency guarantee you want. But it has implications:

Long plan times: A change to root.yml with 50 associated projects means 50 concurrent (or queued) Terraform plans. This can take significant time.

No line-level awareness: Atlantis doesn’t know that you only changed the redis_engine_version line. It sees “root.yml changed” and triggers all associated projects. This means you might plan 50 projects when only 5 would actually show changes.

Resource consumption: All those plans consume CPU, memory, and API calls to your cloud provider.

The trade-off is worth it for consistency. Knowing that a configuration change has been validated against all affected infrastructure before merge is powerful. But you need to tune for it.

Performance Tuning Atlantis for High Throughput

Running Atlantis at scale requires deliberate tuning. Here’s what I’ve found effective running on AWS EC2:

Storage Configuration

Terraform and Terragrunt are I/O heavy. Provider downloads, module caching, state file operations, and plan file generation all hit disk.

# Switch from ext4 to XFS for better parallel I/O
mkfs.xfs /dev/nvme1n1
mount -o noatime,nodiratime /dev/nvme1n1 /home/atlantis

NVMe instance storage (i3/i3en instances) dramatically outperforms EBS for this workload. The latency difference is noticeable when running dozens of concurrent plans.

Caching Configuration

Provider and module downloads are repetitive. For single-instance deployments, caching reduces init times significantly:

# terragrunt download directory - safe to cache
export TERRAGRUNT_DOWNLOAD="/home/atlantis/.terragrunt-cache"

For Terraform provider caching, there’s a caveat. While TF_PLUGIN_CACHE_DIR works well for single-instance deployments:

# Only recommended for single-instance deployments
export TF_PLUGIN_CACHE_DIR="/home/atlantis/.terraform.d/plugin-cache"
mkdir -p $TF_PLUGIN_CACHE_DIR

This becomes problematic in horizontally scaled environments. Terraform has a known issue where the plugin cache interacts poorly with lock files when shared across multiple machines or pods. The cache creates symlinks to cached providers, and when different instances have different cache states, you encounter checksum mismatches and “provider binary not found” errors.

For horizontally scaled Atlantis deployments, skip TF_PLUGIN_CACHE_DIR and let each pod download providers independently. The download overhead is acceptable, and you avoid intermittent failures that are painful to debug.

Parallelization Trade-offs

Atlantis supports parallel plan/apply execution:

# atlantis.yaml
parallel_plan: true
parallel_apply: false  # Be cautious here

The --parallel-pool-size flag controls how many operations run concurrently. The right value depends on your instance size:

Instance Type	vCPUs	Recommended Pool Size
c6i.xlarge	4	4-6
c6i.2xlarge	8	8-12
c6i.4xlarge	16	12-20

There’s a balance to strike. More parallelization means faster completion of many small plans. But Terraform plans can be memory-intensive—too much parallelization leads to OOM kills and swap thrashing.

For workloads with many large plans (lots of resources per state file), reduce parallelization:

atlantis server \
  --parallel-pool-size=8 \
  --repo-allowlist="github.com/your-org/*"

For workloads with many small plans, increase it:

atlantis server \
  --parallel-pool-size=20 \
  --repo-allowlist="github.com/your-org/*"

Scaling Architectures

Atlantis supports both vertical and horizontal scaling, depending on your throughput requirements.

Vertical scaling is the simplest starting point. A single well-tuned c6i.4xlarge handles substantial throughput. For many teams, this is sufficient—combined with the caching and storage optimisations above, a single instance can process dozens of concurrent plans.

Horizontal scaling becomes viable when you configure Atlantis to use Redis for external locking. By default, Atlantis uses BoltDB for its locking database, which only allows a single process to access it at a time—this is what historically limited horizontal scaling. However, with the --locking-db-type redis flag, Atlantis externalises lock management to a Redis cluster, enabling true horizontal scaling:

atlantis server \
  --locking-db-type=redis \
  --redis-host=your-redis-cluster.cache.amazonaws.com \
  --redis-port=6379 \
  --parallel-pool-size=10 \
  --repo-allowlist="github.com/your-org/*"

With Redis locking in place, you have several deployment options:

Kubernetes: Deploy Atlantis as a Deployment with multiple replicas. The Redis backend handles coordination between pods. This is particularly attractive if you’re already running Kubernetes—you get pod autoscaling, rolling deployments, and straightforward configuration management via Helm or Kustomize.

# Example Kubernetes deployment snippet
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atlantis
spec:
  replicas: 3
  template:
    spec:
      containers:
        - name: atlantis
          args:
            - server
            - --locking-db-type=redis
            - --redis-host=$(REDIS_HOST)
            - --parallel-pool-size=10

ECS with Fargate or EC2 capacity: Run Atlantis as an ECS service with desired count > 1. Use an Application Load Balancer in front with health checks. ECS handles container orchestration while Redis handles locking coordination.

EC2 Auto Scaling Groups: Deploy Atlantis instances behind an ALB with an ASG. This gives you the flexibility to scale based on CPU/memory metrics or custom CloudWatch metrics tracking plan queue depth.

The choice between these depends on your existing infrastructure and operational preferences. Kubernetes offers the most sophisticated autoscaling options; ECS provides a managed container experience without the Kubernetes overhead; EC2 ASGs give you full control over the compute layer.

Caching Considerations for Horizontal Deployments

One limitation to be aware of: sharing Terraform provider caches across horizontally scaled instances is problematic. Terraform’s plugin cache (TF_PLUGIN_CACHE_DIR) has a known bug where lock file hash mismatches occur when caches are shared across different environments or pod instances. The cache creates symlinks to cached providers, and when different pods have different cache states, you get errors like “provider binary not found” or lock file checksum failures.

For single-instance deployments, the provider cache works well. For horizontally scaled deployments, you have two options: accept that each pod downloads providers independently (which works fine—the download overhead is acceptable), or use a shared volume mount for the Terragrunt cache (TERRAGRUNT_DOWNLOAD) while letting each pod maintain its own Terraform provider cache.

Even without shared provider caches, horizontal scaling provides meaningful throughput gains. The ability to process multiple PRs across multiple pods significantly reduces queue times during busy periods.

Trade-offs and Considerations

This setup isn’t without costs:

Learning curve: Terragrunt’s HCL extensions and Atlantis’s workflow model both require investment to learn. Teams comfortable with “just run terraform apply” need time to adapt.

Debugging complexity: When something fails, you’re debugging through multiple layers. Was it Terraform? Terragrunt? Atlantis? The pre-workflow hook? Each layer adds potential failure modes.

PR review cadence: Because plans run on PR open and apply runs pre-merge, the review process can feel slower. Reviewers need to wait for plans to complete before reviewing. Large cascading changes (root.yml modifications) can block other work.

Operational overhead: While Atlantis can be horizontally scaled with Redis locking, it still represents critical infrastructure that needs monitoring, maintenance, and capacity planning. The Redis cluster itself becomes a dependency. Plan for this operational burden.

Cost of consistency: The cascading plan behaviour that makes the system reliable also makes it expensive. A one-line change triggering 100 plans uses real compute resources—and with horizontal scaling, those resources multiply across pods.

Caching limitations at scale: As discussed in the performance tuning section, Terraform’s provider cache has known issues in distributed environments. Horizontally scaled deployments trade some caching efficiency for reliability.

The Operational Payoff

Despite the trade-offs, this setup provides substantial operational benefits:

Monorepo simplicity: All infrastructure lives in one repository with unified orchestration. No hunting through multiple repos to understand the full picture.

Cognitive load reduction: Engineers learn one workflow. Open PR, wait for plan, review, apply, merge. The same pattern whether you’re changing a security group rule or provisioning a new environment.

Configuration as data: The YAML hierarchy makes configuration inspectable. You can grep for every use of a specific instance type. You can validate configuration with standard YAML tools.

Reduced CI/CD sprawl: No per-project GitHub Actions workflows. No Jenkins jobs with embedded Terraform commands. The orchestration layer is centralised and consistent.

Auditable change history: Every infrastructure change has a PR with the full plan output. Six months later, you can answer “why is this configured this way?” by finding the relevant PR.

A Note on Dynamic Configuration

Not everything belongs in YAML files. Static configuration—regions, instance sizes, version pins, network layouts—fits this model well. But some configuration is inherently dynamic:

API keys and secrets
Values that change based on runtime conditions
Configuration managed by multiple teams

For these, continue using AWS Secrets Manager, SSM Parameter Store, or HashiCorp Vault. Reference them via data sources in your Terraform modules:

data "aws_secretsmanager_secret_version" "api_key" {
  secret_id = "my-service/api-key"
}

The line between “static configuration in YAML” and “dynamic configuration in secrets manager” is a useful architectural boundary. If it’s set once and changes rarely, it belongs in the hierarchy. If it’s rotated, managed by another team, or changes frequently, externalise it.

Conclusion

The combination of Terraform, Terragrunt, and Atlantis addresses real problems that emerge at scale. Terragrunt provides the configuration hierarchy and dependency orchestration that Terraform lacks. Atlantis shifts the plan/apply cycle to pre-merge, dramatically improving review accuracy and preventing drift. terragrunt-atlantis-config ties them together, ensuring that the dependency relationships you’ve already expressed in Terragrunt are honoured by Atlantis.

It’s not without complexity. The learning curve is real. The performance tuning is necessary. The cascading plan behaviour requires investment in infrastructure.

But the alternative—bespoke CI/CD pipelines, copy-pasted configurations, tribal knowledge about deployment ordering—scales worse. At some point, the complexity of “simple” approaches exceeds the complexity of a proper orchestration layer.

This is that orchestration layer. It’s worth the investment.