replace VMs by rebaking images instead of patching in place
Immutable infrastructure with Packer and Terraform: zero-downtime VM deployments at scale
14 min read
SSH patches and config drift make mutable VMs unreliable. Bake application and OS state into images with Packer, provision with Terraform, and replace instances through Auto Scaling rolling refresh—without treating servers as pets.
Why mutable VMs drift and where immutability still matters
The classic mutable server lifecycle ends in archaeology: SSH in, install a package, tweak a config, restart a service. Weeks later another engineer repeats the process. Identical machines diverge, staging stops matching production, and security patches become roulette. Containers enforce immutability at the process layer, but not every workload fits a container: legacy OS dependencies, GPU or latency-sensitive jobs, databases on local disks, and regulated images with fixed OS baselines still run on VMs. The goal is container-like discipline—rebuild instead of patch—applied to machine images. When something changes, bake a new AMI or cloud image and replace instances; never mutate production in place.
Bake then deploy: separate image and infrastructure cadences
Phase one uses Packer to produce a machine image with OS packages, application artifacts, monitoring agents, and hardening scripts baked in. The artifact is immutable after build: updates mean a new image ID. Phase two uses Terraform to wire networking, Auto Scaling Groups, load balancers, and IAM from that image. Application releases change the image pipeline; capacity, subnets, and security groups change the infrastructure pipeline. Independent cadences keep each change small and testable. Flow: code commit, CI builds app artifacts, Packer bake, manifest outputs AMI ID, Terraform applies launch template, instance refresh rolls replacements behind the load balancer.
Packer template: provision, validate, emit manifest
Pin the Amazon plugin, parameterize app_version and base_ami, tag images with build provenance, and run self-tests before finalize. Copy application tarballs from CI, install agents with version-pinned packages, and emit packer-manifest.json for downstream Terraform. Keep user_data in Terraform minimal—secrets and environment-specific values belong in SSM Parameter Store or instance roles, not in the golden image.
packer {
required_plugins {
amazon = {
source = "github.com/hashicorp/amazon"
version = "~> 1.3"
}
}
}
variable "app_version" { type = string }
variable "base_ami" { type = string }
source "amazon-ebs" "webserver" {
ami_name = "webserver-${var.app_version}-{{timestamp}}"
instance_type = "t3.medium"
region = "us-east-1"
source_ami = var.base_ami
ssh_username = "ec2-user"
tags = {
Name = "webserver-${var.app_version}"
AppVersion = var.app_version
BuildTime = "{{timestamp}}"
ManagedBy = "packer"
}
}
build {
sources = ["source.amazon-ebs.webserver"]
provisioner "shell" {
inline = [
"sudo dnf update -y",
"sudo dnf install -y amazon-cloudwatch-agent awscli jq",
"sudo systemctl enable amazon-cloudwatch-agent",
]
}
provisioner "file" {
source = "build/app-${var.app_version}.tar.gz"
destination = "/tmp/app.tar.gz"
}
provisioner "shell" {
inline = [
"sudo mkdir -p /opt/app",
"sudo tar -xzf /tmp/app.tar.gz -C /opt/app",
"sudo chown -R appuser:appuser /opt/app",
"sudo systemctl enable app-server",
]
}
provisioner "shell" {
inline = ["sudo /opt/app/bin/healthcheck --self-test"]
}
post-processor "manifest" {
output = "packer-manifest.json"
strip_path = true
}
}Terraform: launch template, Auto Scaling Group, and rolling instance refresh
Reference the baked ami_id in a launch template with create_before_destroy. Attach the Auto Scaling Group to an Application Load Balancer target group with ELB health checks. Start aws_autoscaling_instance_refresh after launch template updates so replacements respect min_healthy_percentage and instance_warmup. Ignore desired_capacity drift if autoscaling policies adjust capacity separately.
resource "aws_launch_template" "webserver" {
name_prefix = "webserver-"
image_id = var.ami_id
instance_type = "t3.medium"
iam_instance_profile {
name = aws_iam_instance_profile.webserver.name
}
network_interfaces {
security_groups = [aws_security_group.webserver.id]
associate_public_ip_address = false
}
tag_specifications {
resource_type = "instance"
tags = {
Name = "webserver"
AppVersion = var.app_version
ManagedBy = "terraform"
}
}
lifecycle {
create_before_destroy = true
}
}
resource "aws_autoscaling_group" "webserver" {
name = "webserver-asg"
desired_capacity = var.desired_capacity
min_size = var.desired_capacity
max_size = var.desired_capacity + 2
vpc_zone_identifier = var.private_subnet_ids
launch_template {
id = aws_launch_template.webserver.id
version = "$Latest"
}
target_group_arns = [aws_lb_target_group.webserver.arn]
health_check_type = "ELB"
health_check_grace_period = 120
lifecycle {
ignore_changes = [desired_capacity]
}
}
resource "aws_autoscaling_instance_refresh" "webserver" {
autoscaling_group_name = aws_autoscaling_group.webserver.name
strategy = "Rolling"
preferences {
min_healthy_percentage = 75
instance_warmup = 120
}
triggers {
launch_template {
versions = [aws_launch_template.webserver.latest_version]
}
}
}resource "aws_lb_target_group" "webserver" {
name = "webserver-tg"
port = 8080
protocol = "HTTP"
vpc_id = var.vpc_id
health_check {
path = "/healthz"
interval = 15
timeout = 5
healthy_threshold = 2
unhealthy_threshold = 3
}
}CI pipeline: bake, apply, wait for refresh
Chain Packer build, extract AMI ID from manifest, Terraform apply with new variables, then poll instance refresh status until Successful. There is no aws autoscaling wait instance-refresh command—use describe-instance-refreshes in a loop. Store previous AMI IDs in SSM or a manifest registry for one-command rollback.
#!/usr/bin/env bash
set -euo pipefail
APP_VERSION="${1:?Usage: deploy.sh <version>}"
packer init packer/
packer build -var "app_version=${APP_VERSION}" packer/web-server.pkr.hcl
AMI_ID=$(jq -r '.builds[-1].artifact_id' packer/packer-manifest.json | cut -d: -f2)
cd infrastructure/
terraform init -input=false
terraform apply -auto-approve \
-var "app_version=${APP_VERSION}" \
-var "ami_id=${AMI_ID}"
REFRESH_ID=$(aws autoscaling describe-instance-refreshes \
--auto-scaling-group-name webserver-asg \
--query 'InstanceRefreshes[0].InstanceRefreshId' --output text)
until [[ "$(aws autoscaling describe-instance-refreshes \
--auto-scaling-group-name webserver-asg \
--instance-refresh-ids "$REFRESH_ID" \
--query 'InstanceRefreshes[0].Status' --output text)" == "Successful" ]]; do
sleep 15
done
echo "Deployed ${APP_VERSION} on AMI ${AMI_ID}"Operational practices: provenance, no SSH, lean images, and rollback
Tag every image and instance with commit SHA, build ID, and template version. Block SSH to production with security groups; detect config drift with agents. Pin package versions in Packer provisioners. Split a quarterly hardened base image from frequent application layers. Run integration tests against a temporary instance before promoting AMI IDs. Keep the last three to five images for rollback by re-applying Terraform with the previous ami_id. Target application bake times under ten minutes through caching and smaller artifacts. Wire the full lineage in CI so auditors trace commit to AMI to instance refresh without manual steps. Immutable VMs deliver reproducibility and safe rollback for workloads that still need the machine model—without the pet-server habit.
Validate Terraform modules before production rollout using patterns from our Terraform and Kitchen-Terraform testing guide.
Immutable deploys still need drift detection on the orchestration layer, covered in our infrastructure drift detection and remediation with Terraform guide.
