keep live infrastructure aligned with Terraform desired state
Infrastructure drift detection and remediation with Terraform
13 min read
Manual console edits and stale state silently diverge from Terraform code. This guide runs scheduled drift scans, classifies low-risk changes for auto-remediation, and adds backend locking, lifecycle rules, and policy guardrails.
Why infrastructure drift is an operational certainty
Production estates drift. Engineers patch security groups during incidents, providers change default behaviors, and attributes evolve outside IaC workflows. The gap between declared Terraform configuration and live cloud state drives mystery outages, audit failures, and apply plans that surprise on-call engineers. Drift falls into three buckets: configuration drift from console or CLI edits, state drift from concurrent applies or manual state edits, and external drift when provider defaults move underneath you. A scheduled terraform plan compares desired configuration plus state against refreshed provider data—it does not mutate infrastructure. The goal is not zero drift, but fast detection, an explicit resolution decision, and prevention that stops silent accumulation.
A three-layer strategy: detect, remediate, prevent
Effective drift management stacks three capabilities. Detection runs read-only terraform plan on a schedule, typically every four to six hours per environment. Use terraform plan -refresh-only when you only need to update state from live APIs without proposing changes. Remediation reconciles only pre-approved low-risk resource types through a saved plan file, never blind auto-apply, and never when the plan includes delete or replace actions. Prevention routes mutations through CI/CD with remote state locking, lifecycle rules for legitimately mutable fields, and policy-as-code on plan JSON before apply. Terraform Cloud and HCP Terraform expose managed drift detection for registered workspaces; Spacelift and env0 offer similar views. The GitHub Actions patterns below are portable to any orchestrator.
Scheduled drift detection with detailed exit codes
terraform plan -detailed-exitcode returns 0 when no changes are needed, 2 when drift exists, and 1 on error. Persist human-readable output and a binary tfplan artifact. To classify changes in scripts, run terraform show -json tfplan—plan -out and plan -json serve different purposes and the saved plan file is the source of truth for both apply and JSON parsing. Prefer GitHub OIDC to AWS over long-lived access keys in scheduled workflows, and set terraform_wrapper: false in hashicorp/setup-terraform when you pipe plan output to files. Correlate findings with CloudTrail or equivalent audit logs. Deduplicate alerts by skipping new issues when an open drift issue already exists.
name: Infrastructure Drift Detection
on:
schedule:
- cron: '0 */6 * * *'
workflow_dispatch:
permissions:
contents: read
issues: write
id-token: write
jobs:
detect-drift:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
with:
terraform_version: "1.8.0"
terraform_wrapper: false
- name: Terraform init
run: terraform init -input=false
- name: Terraform plan
id: plan
run: |
set +e
terraform plan -detailed-exitcode -input=false -out=tfplan 2>&1 | tee plan_output.txt
EXIT_CODE=${PIPESTATUS[0]}
set -e
echo "exit_code=$EXIT_CODE" >> $GITHUB_OUTPUT
if [ "$EXIT_CODE" -eq 2 ]; then
echo "drift_detected=true" >> $GITHUB_OUTPUT
terraform show -json tfplan > plan.json
fi
- name: Create issue on drift
if: steps.plan.outputs.drift_detected == 'true'
uses: actions/github-script@v7
with:
script: |
const { data: openIssues } = await github.rest.issues.listForRepo({
owner: context.repo.owner,
repo: context.repo.repo,
state: 'open',
labels: 'drift',
});
if (openIssues.length > 0) return;
const fs = require('fs');
const planOutput = fs.readFileSync('plan_output.txt', 'utf8');
const truncated = planOutput.length > 60000
? planOutput.substring(0, 60000) + '\n\n... (truncated)'
: planOutput;
await github.rest.issues.create({
owner: context.repo.owner,
repo: context.repo.repo,
title: `Infrastructure drift detected - ${new Date().toISOString().split('T')[0]}`,
body: [
'Scheduled drift scan found resources diverging from Terraform state.',
'',
'### Plan output',
'```',
truncated,
'```',
'',
'### Next steps',
'1. Decide: adopt in code, import, or revert live change',
'2. Correlate with CloudTrail or audit logs',
'3. Add ignore_changes only for documented exceptions'
].join('\n'),
labels: ['drift', 'infrastructure']
});Risk-based auto-remediation for allowlisted resource types
Not all drift should auto-heal. Parse resource_changes[].type from plan JSON—never infer type from the address string, because module.vpc.aws_security_group.web would parse incorrectly. Block any plan that includes delete or replace actions. A changed CloudWatch alarm threshold may be low risk; a modified security group or IAM binding is not. Apply only the saved tfplan after allowlist and action checks pass.
#!/bin/bash
set -euo pipefail
ALLOWED_TYPES=(
aws_cloudwatch_metric_alarm
aws_sns_topic
)
log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*"; }
set +e
terraform plan -detailed-exitcode -input=false -out=tfplan
EXIT_CODE=$?
set -e
if [ "$EXIT_CODE" -eq 0 ]; then
log "No drift detected."
exit 0
fi
if [ "$EXIT_CODE" -ne 2 ]; then
log "Plan failed with exit code $EXIT_CODE."
exit 1
fi
terraform show -json tfplan > plan.json
if jq -e '.resource_changes[]? | select(.change.actions | index("delete"))' plan.json >/dev/null; then
log "Delete actions require manual review."
exit 1
fi
while IFS= read -r TYPE; do
ALLOWED=false
for allowed in "${ALLOWED_TYPES[@]}"; do
if [ "$TYPE" = "$allowed" ]; then ALLOWED=true; break; fi
done
if [ "$ALLOWED" = false ]; then
log "Manual review required for resource type $TYPE"
exit 1
fi
done < <(jq -r '.resource_changes[]? | select(.change.actions != ["no-op"]) | .type' plan.json | sort -u)
log "Applying allowlisted drift remediation from saved plan"
terraform apply -input=false tfplan
log "Auto-remediation completed"Prevention guardrails: remote state, lifecycle, and policy
Centralize state in S3 with DynamoDB locking—or your cloud equivalent—and block ungated applies from laptops. Use lifecycle ignore_changes only for documented operational exceptions, not as a blanket drift hiding mechanism. Sentinel integrates with HCP Terraform and Terraform Enterprise; Conftest with Rego is a practical open-source gate on terraform show -json output in any CI system. Require plan-before-apply: terraform plan -out=tfplan, policy check, then terraform apply tfplan.
terraform {
backend "s3" {
bucket = "my-terraform-state"
key = "production/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-locks"
encrypt = true
}
}resource "aws_instance" "web" {
ami = "ami-0c55b159cbfafe1f0"
instance_type = "t3.medium"
lifecycle {
ignore_changes = [
instance_type,
tags["LastManualUpdate"],
]
prevent_destroy = true
}
}package terraform.security
deny[msg] {
rc := input.resource_changes[_]
rc.type == "aws_security_group"
ingress := rc.change.after.ingress[_]
ingress.cidr_blocks[_] == "0.0.0.0/0"
ingress.from_port <= 22
ingress.to_port >= 22
msg := sprintf("public SSH denied on %s", [rc.address])
}How to resolve drift once it is confirmed
Detection tells you that state diverged; it does not choose the fix. Use three decision paths. Accidental console change: apply the saved plan to reconcile live infrastructure back to code, or manually revert the cloud change if apply risk is too high. Intentional operational change: update Terraform configuration in a pull request, pass normal review, then apply through CI. Resource exists in the cloud but not in state: import with terraform import or a code-first import block, then verify with plan. Provider-default or read-only attribute noise: run terraform plan -refresh-only to align state, or add a narrow ignore_changes entry tied to a ticket. Never auto-remediate when the plan destroys or replaces networking, identity, or data resources—those require human review and an audit trail entry.
End-to-end pipeline: classify drift and branch remediation
Combine detection and remediation in one workflow: save tfplan, export JSON with terraform show -json, classify by resource type, auto-apply only when no high-risk types change and no delete actions appear, and open a security-labeled issue otherwise. Log every run even when no drift is found to track drift rate and time-to-remediation as platform health indicators. Each drift event is a process signal: was the manual change necessary because Terraform was too rigid, or because CI bypass is too easy?
name: Drift Detection and Remediation
on:
schedule:
- cron: '0 */4 * * *'
jobs:
detect-and-remediate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
with:
terraform_wrapper: false
- name: Init and plan
id: plan
run: |
set +e
terraform init -input=false
terraform plan -detailed-exitcode -input=false -out=tfplan
EXIT_CODE=$?
set -e
echo "exit=$EXIT_CODE" >> $GITHUB_OUTPUT
if [ "$EXIT_CODE" -eq 2 ]; then
terraform show -json tfplan > plan.json
fi
- name: Classify drift risk
if: steps.plan.outputs.exit == '2'
id: classify
run: |
HIGH_RISK=false
if jq -e '.resource_changes[]? | select(.change.actions | index("delete"))' plan.json >/dev/null; then
HIGH_RISK=true
fi
while IFS= read -r TYPE; do
case "$TYPE" in
aws_security_group|aws_vpc_security_group_*|aws_network_acl|aws_vpc|aws_subnet|aws_kms_key)
HIGH_RISK=true
;;
esac
if [[ "$TYPE" == aws_iam_* ]]; then HIGH_RISK=true; fi
done < <(jq -r '.resource_changes[]? | select(.change.actions != ["no-op"]) | .type' plan.json | sort -u)
echo "high_risk=$HIGH_RISK" >> $GITHUB_OUTPUT
- name: Auto-remediate low-risk drift
if: steps.plan.outputs.exit == '2' && steps.classify.outputs.high_risk == 'false'
run: terraform apply -input=false tfplan
- name: Create issue for high-risk drift
if: steps.plan.outputs.exit == '2' && steps.classify.outputs.high_risk == 'true'
uses: actions/github-script@v7
with:
script: |
const changes = JSON.parse(require('fs').readFileSync('plan.json', 'utf8'));
const drift = changes.resource_changes.filter(r => !r.change.actions.includes('no-op'));
const lines = drift.map(r => `- ${r.address} (${r.type}): ${r.change.actions.join(', ')}`);
await github.rest.issues.create({
owner: context.repo.owner,
repo: context.repo.repo,
title: `High-risk infrastructure drift - ${new Date().toISOString().split('T')[0]}`,
body: ['Manual review required:', '', ...lines].join('\n'),
labels: ['drift', 'security', 'high-priority']
});Drift checks complement pre-merge validation from our Terraform and Kitchen-Terraform testing guide.
When drift signals process failure, track remediation latency alongside SLO, SLI, and error budget practices for platform teams.
