keep live infrastructure aligned with Terraform desired state

Infrastructure drift detection and remediation with Terraform

13 min read

Manual console edits and stale state silently diverge from Terraform code. This guide runs scheduled drift scans, classifies low-risk changes for auto-remediation, and adds backend locking, lifecycle rules, and policy guardrails.

Why infrastructure drift is an operational certainty

Production estates drift. Engineers patch security groups during incidents, providers change default behaviors, and attributes evolve outside IaC workflows. The gap between declared Terraform configuration and live cloud state drives mystery outages, audit failures, and apply plans that surprise on-call engineers. Drift falls into three buckets: configuration drift from console or CLI edits, state drift from concurrent applies or manual state edits, and external drift when provider defaults move underneath you. A scheduled terraform plan compares desired configuration plus state against refreshed provider data—it does not mutate infrastructure. The goal is not zero drift, but fast detection, an explicit resolution decision, and prevention that stops silent accumulation.

A three-layer strategy: detect, remediate, prevent

Effective drift management stacks three capabilities. Detection runs read-only terraform plan on a schedule, typically every four to six hours per environment. Use terraform plan -refresh-only when you only need to update state from live APIs without proposing changes. Remediation reconciles only pre-approved low-risk resource types through a saved plan file, never blind auto-apply, and never when the plan includes delete or replace actions. Prevention routes mutations through CI/CD with remote state locking, lifecycle rules for legitimately mutable fields, and policy-as-code on plan JSON before apply. Terraform Cloud and HCP Terraform expose managed drift detection for registered workspaces; Spacelift and env0 offer similar views. The GitHub Actions patterns below are portable to any orchestrator.

Scheduled drift detection with detailed exit codes

terraform plan -detailed-exitcode returns 0 when no changes are needed, 2 when drift exists, and 1 on error. Persist human-readable output and a binary tfplan artifact. To classify changes in scripts, run terraform show -json tfplan—plan -out and plan -json serve different purposes and the saved plan file is the source of truth for both apply and JSON parsing. Prefer GitHub OIDC to AWS over long-lived access keys in scheduled workflows, and set terraform_wrapper: false in hashicorp/setup-terraform when you pipe plan output to files. Correlate findings with CloudTrail or equivalent audit logs. Deduplicate alerts by skipping new issues when an open drift issue already exists.

GitHub Actions · scheduled drift detection

name: Infrastructure Drift Detection
on:
  schedule:
    - cron: '0 */6 * * *'
  workflow_dispatch:

permissions:
  contents: read
  issues: write
  id-token: write

jobs:
  detect-drift:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: "1.8.0"
          terraform_wrapper: false

      - name: Terraform init
        run: terraform init -input=false

      - name: Terraform plan
        id: plan
        run: |
          set +e
          terraform plan -detailed-exitcode -input=false -out=tfplan 2>&1 | tee plan_output.txt
          EXIT_CODE=${PIPESTATUS[0]}
          set -e
          echo "exit_code=$EXIT_CODE" >> $GITHUB_OUTPUT
          if [ "$EXIT_CODE" -eq 2 ]; then
            echo "drift_detected=true" >> $GITHUB_OUTPUT
            terraform show -json tfplan > plan.json
          fi

      - name: Create issue on drift
        if: steps.plan.outputs.drift_detected == 'true'
        uses: actions/github-script@v7
        with:
          script: |
            const { data: openIssues } = await github.rest.issues.listForRepo({
              owner: context.repo.owner,
              repo: context.repo.repo,
              state: 'open',
              labels: 'drift',
            });
            if (openIssues.length > 0) return;
            const fs = require('fs');
            const planOutput = fs.readFileSync('plan_output.txt', 'utf8');
            const truncated = planOutput.length > 60000
              ? planOutput.substring(0, 60000) + '\n\n... (truncated)'
              : planOutput;
            await github.rest.issues.create({
              owner: context.repo.owner,
              repo: context.repo.repo,
              title: `Infrastructure drift detected - ${new Date().toISOString().split('T')[0]}`,
              body: [
                'Scheduled drift scan found resources diverging from Terraform state.',
                '',
                '### Plan output',
                '```',
                truncated,
                '```',
                '',
                '### Next steps',
                '1. Decide: adopt in code, import, or revert live change',
                '2. Correlate with CloudTrail or audit logs',
                '3. Add ignore_changes only for documented exceptions'
              ].join('\n'),
              labels: ['drift', 'infrastructure']
            });

Risk-based auto-remediation for allowlisted resource types

Not all drift should auto-heal. Parse resource_changes[].type from plan JSON—never infer type from the address string, because module.vpc.aws_security_group.web would parse incorrectly. Block any plan that includes delete or replace actions. A changed CloudWatch alarm threshold may be low risk; a modified security group or IAM binding is not. Apply only the saved tfplan after allowlist and action checks pass.

Bash · allowlisted auto-remediation script

#!/bin/bash
set -euo pipefail

ALLOWED_TYPES=(
  aws_cloudwatch_metric_alarm
  aws_sns_topic
)

log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*"; }

set +e
terraform plan -detailed-exitcode -input=false -out=tfplan
EXIT_CODE=$?
set -e

if [ "$EXIT_CODE" -eq 0 ]; then
  log "No drift detected."
  exit 0
fi

if [ "$EXIT_CODE" -ne 2 ]; then
  log "Plan failed with exit code $EXIT_CODE."
  exit 1
fi

terraform show -json tfplan > plan.json

if jq -e '.resource_changes[]? | select(.change.actions | index("delete"))' plan.json >/dev/null; then
  log "Delete actions require manual review."
  exit 1
fi

while IFS= read -r TYPE; do
  ALLOWED=false
  for allowed in "${ALLOWED_TYPES[@]}"; do
    if [ "$TYPE" = "$allowed" ]; then ALLOWED=true; break; fi
  done
  if [ "$ALLOWED" = false ]; then
    log "Manual review required for resource type $TYPE"
    exit 1
  fi
done < <(jq -r '.resource_changes[]? | select(.change.actions != ["no-op"]) | .type' plan.json | sort -u)

log "Applying allowlisted drift remediation from saved plan"
terraform apply -input=false tfplan
log "Auto-remediation completed"

Prevention guardrails: remote state, lifecycle, and policy

Centralize state in S3 with DynamoDB locking—or your cloud equivalent—and block ungated applies from laptops. Use lifecycle ignore_changes only for documented operational exceptions, not as a blanket drift hiding mechanism. Sentinel integrates with HCP Terraform and Terraform Enterprise; Conftest with Rego is a practical open-source gate on terraform show -json output in any CI system. Require plan-before-apply: terraform plan -out=tfplan, policy check, then terraform apply tfplan.

HCL · remote state backend with locking

terraform {
  backend "s3" {
    bucket         = "my-terraform-state"
    key            = "production/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-locks"
    encrypt        = true
  }
}

HCL · lifecycle rules for intentional exceptions

resource "aws_instance" "web" {
  ami           = "ami-0c55b159cbfafe1f0"
  instance_type = "t3.medium"

  lifecycle {
    ignore_changes = [
      instance_type,
      tags["LastManualUpdate"],
    ]
    prevent_destroy = true
  }
}

Rego · Conftest deny public SSH on security groups

package terraform.security

deny[msg] {
  rc := input.resource_changes[_]
  rc.type == "aws_security_group"
  ingress := rc.change.after.ingress[_]
  ingress.cidr_blocks[_] == "0.0.0.0/0"
  ingress.from_port <= 22
  ingress.to_port >= 22
  msg := sprintf("public SSH denied on %s", [rc.address])
}

How to resolve drift once it is confirmed

Detection tells you that state diverged; it does not choose the fix. Use three decision paths. Accidental console change: apply the saved plan to reconcile live infrastructure back to code, or manually revert the cloud change if apply risk is too high. Intentional operational change: update Terraform configuration in a pull request, pass normal review, then apply through CI. Resource exists in the cloud but not in state: import with terraform import or a code-first import block, then verify with plan. Provider-default or read-only attribute noise: run terraform plan -refresh-only to align state, or add a narrow ignore_changes entry tied to a ticket. Never auto-remediate when the plan destroys or replaces networking, identity, or data resources—those require human review and an audit trail entry.

End-to-end pipeline: classify drift and branch remediation

Combine detection and remediation in one workflow: save tfplan, export JSON with terraform show -json, classify by resource type, auto-apply only when no high-risk types change and no delete actions appear, and open a security-labeled issue otherwise. Log every run even when no drift is found to track drift rate and time-to-remediation as platform health indicators. Each drift event is a process signal: was the manual change necessary because Terraform was too rigid, or because CI bypass is too easy?

GitHub Actions · classify and remediate drift

name: Drift Detection and Remediation
on:
  schedule:
    - cron: '0 */4 * * *'

jobs:
  detect-and-remediate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_wrapper: false

      - name: Init and plan
        id: plan
        run: |
          set +e
          terraform init -input=false
          terraform plan -detailed-exitcode -input=false -out=tfplan
          EXIT_CODE=$?
          set -e
          echo "exit=$EXIT_CODE" >> $GITHUB_OUTPUT
          if [ "$EXIT_CODE" -eq 2 ]; then
            terraform show -json tfplan > plan.json
          fi

      - name: Classify drift risk
        if: steps.plan.outputs.exit == '2'
        id: classify
        run: |
          HIGH_RISK=false
          if jq -e '.resource_changes[]? | select(.change.actions | index("delete"))' plan.json >/dev/null; then
            HIGH_RISK=true
          fi
          while IFS= read -r TYPE; do
            case "$TYPE" in
              aws_security_group|aws_vpc_security_group_*|aws_network_acl|aws_vpc|aws_subnet|aws_kms_key)
                HIGH_RISK=true
                ;;
            esac
            if [[ "$TYPE" == aws_iam_* ]]; then HIGH_RISK=true; fi
          done < <(jq -r '.resource_changes[]? | select(.change.actions != ["no-op"]) | .type' plan.json | sort -u)
          echo "high_risk=$HIGH_RISK" >> $GITHUB_OUTPUT

      - name: Auto-remediate low-risk drift
        if: steps.plan.outputs.exit == '2' && steps.classify.outputs.high_risk == 'false'
        run: terraform apply -input=false tfplan

      - name: Create issue for high-risk drift
        if: steps.plan.outputs.exit == '2' && steps.classify.outputs.high_risk == 'true'
        uses: actions/github-script@v7
        with:
          script: |
            const changes = JSON.parse(require('fs').readFileSync('plan.json', 'utf8'));
            const drift = changes.resource_changes.filter(r => !r.change.actions.includes('no-op'));
            const lines = drift.map(r => `- ${r.address} (${r.type}): ${r.change.actions.join(', ')}`);
            await github.rest.issues.create({
              owner: context.repo.owner,
              repo: context.repo.repo,
              title: `High-risk infrastructure drift - ${new Date().toISOString().split('T')[0]}`,
              body: ['Manual review required:', '', ...lines].join('\n'),
              labels: ['drift', 'security', 'high-priority']
            });

Drift checks complement pre-merge validation from our Terraform and Kitchen-Terraform testing guide.

When drift signals process failure, track remediation latency alongside SLO, SLI, and error budget practices for platform teams.

Tags:terraform infrastructure-as-code drift-detection devops policy-as-code

Discuss your infrastructure goals

Infrastructure drift detection and remediation with Terraform

Why infrastructure drift is an operational certainty

A three-layer strategy: detect, remediate, prevent

Scheduled drift detection with detailed exit codes

Risk-based auto-remediation for allowlisted resource types

Prevention guardrails: remote state, lifecycle, and policy

How to resolve drift once it is confirmed

End-to-end pipeline: classify drift and branch remediation

You might also like

SLOs, SLIs, and error budgets for platform teams: a minimal reliability contract

GitOps workflows with Argo CD and Flux: consistency and compliance in Kubernetes

Testing Infrastructure as Code: reliable deployments with Terraform and Kitchen-Terraform