make disaster recovery repeatable, testable, and aligned to RTO and RPO
Disaster Recovery as Code: automate RTO (recovery time) and RPO (recovery point) with infrastructure templates
13 min read
RTO caps how long a service may stay down; RPO caps how much data you can lose. This guide encodes both targets in Terraform, automates backup verification, provisions failover infrastructure, and orchestrates recovery with tested pipelines.
What RTO and RPO mean—and why every DR plan needs them
Recovery Time Objective (RTO) is the maximum acceptable downtime after a failure: how quickly the service must be running again. Recovery Point Objective (RPO) is the maximum acceptable data loss: how far back in time you are willing to restore. If RPO is one hour, backups or replication must be at least that fresh; if RTO is four hours, your runbooks, failover stack, and drills must prove you can recover within that window. These two numbers drive backup cadence, multi-region spend, and orchestration depth—they turn vague “we need DR” requests into measurable engineering targets you can test in game days.
Why manual disaster recovery plans fail in real incidents
Most teams have a disaster recovery document that was written once and rarely exercised. When production fails, runbooks reference decommissioned services, backup formats changed without notice, and restore order lives in someone's memory. Four gaps repeat across organizations: documentation drifts from live infrastructure, restore procedures are never timed end to end, RTO and RPO differ per team without measurement, and cutover steps are tribal knowledge. Disaster Recovery as Code (DRaC) applies IaC discipline to recovery: versioned templates, automated drills, auditable changes, and explicit targets you can test.
DRaC building blocks: targets, topology, verification, orchestration
Declaring RTO and RPO in code does not guarantee them—you prove both with timed drills. Pick a recovery topology deliberately. Backup and restore fits the lowest cost but highest RTO. Pilot light keeps minimal DR resources running and scales on failover. Warm standby runs a reduced replica stack continuously. Multi-region active-passive or active-active costs more but shrinks RTO and RPO. Declare targets as versioned variables, provision DR infrastructure from the same modules as production with region or account variants, verify backups with scheduled restore tests, and orchestrate cutover with an explicit dependency graph rather than ad hoc runbook steps.
Declare RTO and RPO targets and align backup cadence
Store RTO and RPO as Terraform variables reviewed like any production change. Map RPO to a backup schedule AWS Backup actually supports—typically cron expressions, not arbitrary minute rates. Organizations backup policies are useful at enterprise scale; aws_backup_plan is a clearer starting point for application teams. Document that meeting RPO requires replication or backup frequency below the target plus monitoring of last-successful-backup age.
variable "rto_minutes" {
description = "Maximum acceptable service downtime in minutes"
type = number
default = 60
}
variable "rpo_minutes" {
description = "Maximum acceptable data loss window in minutes"
type = number
default = 60
}
locals {
# Align schedule to the nearest practical cadence for your RPO
backup_schedule = var.rpo_minutes <= 60 ? "cron(0 * * * ? *)" : "cron(0 0 * * ? *)"
}
resource "aws_backup_vault" "dr" {
name = "application-dr-vault"
}
resource "aws_backup_plan" "application" {
name = "application-dr-plan"
rule {
rule_name = "rpo-aligned-backup"
target_vault_name = aws_backup_vault.dr.name
schedule = local.backup_schedule
lifecycle {
delete_after = 35
}
}
}Automate backup restore drills that measure RPO and RTO
Backups without restore tests are storage costs. Run isolated drills that restore to a throwaway instance name, measure elapsed time for the database availability waiter, and assert snapshot age against RPO. Full RTO includes DNS propagation and application warm-up—scope drill metrics clearly so teams do not confuse database restore time with end-to-end service recovery. Publish results to CloudWatch or your incident metrics store and alert on failure.
import time
from datetime import datetime, timezone
import boto3
class BackupVerifier:
def __init__(self, rto_minutes=60, rpo_minutes=60):
self.rto_minutes = rto_minutes
self.rpo_minutes = rpo_minutes
self.rds = boto3.client("rds")
def test_restore(self, source_instance_id, drill_instance_id):
start = time.time()
snapshots = self.rds.describe_db_snapshots(
DBInstanceIdentifier=source_instance_id,
SnapshotType="automated",
)["DBSnapshots"]
if not snapshots:
raise RuntimeError("No automated snapshots found")
latest = max(snapshots, key=lambda item: item["SnapshotCreateTime"])
snapshot_time = latest["SnapshotCreateTime"]
if snapshot_time.tzinfo is None:
snapshot_time = snapshot_time.replace(tzinfo=timezone.utc)
snapshot_age = (
datetime.now(timezone.utc) - snapshot_time
).total_seconds() / 60
if snapshot_age > self.rpo_minutes:
raise AssertionError(
f"RPO breach: snapshot age {snapshot_age:.1f}m exceeds {self.rpo_minutes}m"
)
self.rds.restore_db_instance_from_db_snapshot(
DBInstanceIdentifier=drill_instance_id,
DBSnapshotIdentifier=latest["DBSnapshotIdentifier"],
DBInstanceClass="db.r6g.large",
MultiAZ=False,
PubliclyAccessible=False,
DeletionProtection=False,
)
waiter = self.rds.get_waiter("db_instance_available")
waiter.wait(DBInstanceIdentifier=drill_instance_id)
elapsed = (time.time() - start) / 60
if elapsed > self.rto_minutes:
raise AssertionError(
f"RTO breach: restore took {elapsed:.1f}m, target {self.rto_minutes}m"
)
return {
"status": "PASS",
"rto_actual_minutes": round(elapsed, 1),
"rpo_actual_minutes": round(snapshot_age, 1),
"snapshot_id": latest["DBSnapshotIdentifier"],
}Provision DR infrastructure and DNS failover as code
Define secondary region or account stacks as module variants of production and sync them through the same CI gates. Scheduled terraform plan in the DR workspace catches template drift before an incident. Route 53 failover routing moves traffic when the primary health check fails—use fqdn-based HTTPS checks or rely on ALB target health evaluation with alias records. DNS TTL and resolver caching still add minutes to cutover; factor that into RTO budgets.
name: DR Infrastructure Sync
on:
push:
paths: ['infrastructure/**']
schedule:
- cron: '0 */6 * * *'
workflow_dispatch:
jobs:
dr-sync:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
with:
terraform_wrapper: false
- name: Terraform plan (DR region)
working-directory: infrastructure/dr
run: |
terraform init -input=false
terraform plan -detailed-exitcode -input=false -var="environment=dr" -out=dr.tfplan
- name: Apply DR infrastructure
if: github.ref == 'refs/heads/main'
working-directory: infrastructure/dr
run: terraform apply -input=false dr.tfplanresource "aws_route53_health_check" "primary" {
fqdn = var.primary_healthcheck_fqdn
port = 443
type = "HTTPS"
resource_path = "/health"
failure_threshold = 3
request_interval = 30
tags = {
Name = "primary-api-health-check"
}
}
resource "aws_route53_record" "failover_primary" {
zone_id = var.zone_id
name = "api.example.com"
type = "A"
set_identifier = "primary"
failover_routing_policy {
type = "PRIMARY"
}
alias {
name = var.primary_alb_dns
zone_id = var.primary_alb_zone_id
evaluate_target_health = true
}
health_check_id = aws_route53_health_check.primary.id
}
resource "aws_route53_record" "failover_secondary" {
zone_id = var.zone_id
name = "api.example.com"
type = "A"
set_identifier = "secondary"
failover_routing_policy {
type = "SECONDARY"
}
alias {
name = var.dr_alb_dns
zone_id = var.dr_alb_zone_id
evaluate_target_health = true
}
}Orchestrate recovery with an explicit dependency graph
Tie validation, data restore, cache warm-up, DNS cutover, and post-cutover health checks into a workflow you can trigger during game days. Argo Workflows suits Kubernetes estates; Step Functions or a CI workflow may fit simpler stacks. Keep human approval gates for declaring disaster and executing DNS failover in production—these are policy decisions, not fully autonomous actions.
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
name: disaster-recovery
spec:
entrypoint: recover
templates:
- name: recover
dag:
tasks:
- name: validate-dr-infra
template: validate-infra
- name: restore-databases
template: restore-db
dependencies: [validate-dr-infra]
- name: sync-cache-layer
template: sync-cache
dependencies: [restore-databases]
- name: update-dns
template: dns-failover
dependencies: [sync-cache-layer]
- name: verify-recovery
template: health-check
dependencies: [update-dns]
- name: validate-infra
container:
image: registry.example.com/dr-tools:latest
command: [python, /scripts/validate_dr_infra.py]
- name: restore-db
container:
image: registry.example.com/dr-tools:latest
command: [python, /scripts/restore_database.py, --from-latest-snapshot, --verify-rpo]
- name: sync-cache
container:
image: registry.example.com/dr-tools:latest
command: [python, /scripts/warm_cache.py]
- name: dns-failover
container:
image: registry.example.com/dr-tools:latest
command: [python, /scripts/failover_dns.py]
- name: health-check
container:
image: registry.example.com/dr-tools:latest
command: [python, /scripts/verify_recovery.py]Operational practices that keep DR credible
Run timed game days monthly in an isolated account or region, not annually from a slide deck. Version DR templates beside production and review DR impact in the same pull request when adding data stores. Alert when last successful backup age exceeds RPO-derived thresholds. Store backups in immutable object storage such as S3 Object Lock as ransomware defense. Right-size DR spend: pilot light or warm standby for most workloads, active-active only when the business requires sub-minute RTO. Audit backup encryption, retention, and cross-region copy with AWS Config or OPA continuously. Document who may declare a disaster and when to fail over versus fail back. DR is ongoing practice—DRaC makes it testable instead of theoretical.
RTO and RPO targets should be explicit reliability commitments, which we define in our SLO, SLI, and error budget guide for platform teams.
Recovery drills pair naturally with controlled failure practice from our Chaos Engineering playbook for DevOps.
