make disaster recovery repeatable, testable, and aligned to RTO and RPO

Disaster Recovery as Code: automate RTO (recovery time) and RPO (recovery point) with infrastructure templates

13 min read

RTO caps how long a service may stay down; RPO caps how much data you can lose. This guide encodes both targets in Terraform, automates backup verification, provisions failover infrastructure, and orchestrates recovery with tested pipelines.

What RTO and RPO mean—and why every DR plan needs them

Recovery Time Objective (RTO) is the maximum acceptable downtime after a failure: how quickly the service must be running again. Recovery Point Objective (RPO) is the maximum acceptable data loss: how far back in time you are willing to restore. If RPO is one hour, backups or replication must be at least that fresh; if RTO is four hours, your runbooks, failover stack, and drills must prove you can recover within that window. These two numbers drive backup cadence, multi-region spend, and orchestration depth—they turn vague “we need DR” requests into measurable engineering targets you can test in game days.

Why manual disaster recovery plans fail in real incidents

Most teams have a disaster recovery document that was written once and rarely exercised. When production fails, runbooks reference decommissioned services, backup formats changed without notice, and restore order lives in someone's memory. Four gaps repeat across organizations: documentation drifts from live infrastructure, restore procedures are never timed end to end, RTO and RPO differ per team without measurement, and cutover steps are tribal knowledge. Disaster Recovery as Code (DRaC) applies IaC discipline to recovery: versioned templates, automated drills, auditable changes, and explicit targets you can test.

DRaC building blocks: targets, topology, verification, orchestration

Declaring RTO and RPO in code does not guarantee them—you prove both with timed drills. Pick a recovery topology deliberately. Backup and restore fits the lowest cost but highest RTO. Pilot light keeps minimal DR resources running and scales on failover. Warm standby runs a reduced replica stack continuously. Multi-region active-passive or active-active costs more but shrinks RTO and RPO. Declare targets as versioned variables, provision DR infrastructure from the same modules as production with region or account variants, verify backups with scheduled restore tests, and orchestrate cutover with an explicit dependency graph rather than ad hoc runbook steps.

Declare RTO and RPO targets and align backup cadence

Store RTO and RPO as Terraform variables reviewed like any production change. Map RPO to a backup schedule AWS Backup actually supports—typically cron expressions, not arbitrary minute rates. Organizations backup policies are useful at enterprise scale; aws_backup_plan is a clearer starting point for application teams. Document that meeting RPO requires replication or backup frequency below the target plus monitoring of last-successful-backup age.

HCL · RTO/RPO variables and AWS Backup plan
variable "rto_minutes" {
  description = "Maximum acceptable service downtime in minutes"
  type        = number
  default     = 60
}

variable "rpo_minutes" {
  description = "Maximum acceptable data loss window in minutes"
  type        = number
  default     = 60
}

locals {
  # Align schedule to the nearest practical cadence for your RPO
  backup_schedule = var.rpo_minutes <= 60 ? "cron(0 * * * ? *)" : "cron(0 0 * * ? *)"
}

resource "aws_backup_vault" "dr" {
  name = "application-dr-vault"
}

resource "aws_backup_plan" "application" {
  name = "application-dr-plan"

  rule {
    rule_name         = "rpo-aligned-backup"
    target_vault_name = aws_backup_vault.dr.name
    schedule          = local.backup_schedule

    lifecycle {
      delete_after = 35
    }
  }
}

Automate backup restore drills that measure RPO and RTO

Backups without restore tests are storage costs. Run isolated drills that restore to a throwaway instance name, measure elapsed time for the database availability waiter, and assert snapshot age against RPO. Full RTO includes DNS propagation and application warm-up—scope drill metrics clearly so teams do not confuse database restore time with end-to-end service recovery. Publish results to CloudWatch or your incident metrics store and alert on failure.

Python · RDS restore drill with RPO and partial RTO check
import time
from datetime import datetime, timezone

import boto3


class BackupVerifier:
    def __init__(self, rto_minutes=60, rpo_minutes=60):
        self.rto_minutes = rto_minutes
        self.rpo_minutes = rpo_minutes
        self.rds = boto3.client("rds")

    def test_restore(self, source_instance_id, drill_instance_id):
        start = time.time()

        snapshots = self.rds.describe_db_snapshots(
            DBInstanceIdentifier=source_instance_id,
            SnapshotType="automated",
        )["DBSnapshots"]

        if not snapshots:
            raise RuntimeError("No automated snapshots found")

        latest = max(snapshots, key=lambda item: item["SnapshotCreateTime"])
        snapshot_time = latest["SnapshotCreateTime"]
        if snapshot_time.tzinfo is None:
            snapshot_time = snapshot_time.replace(tzinfo=timezone.utc)

        snapshot_age = (
            datetime.now(timezone.utc) - snapshot_time
        ).total_seconds() / 60

        if snapshot_age > self.rpo_minutes:
            raise AssertionError(
                f"RPO breach: snapshot age {snapshot_age:.1f}m exceeds {self.rpo_minutes}m"
            )

        self.rds.restore_db_instance_from_db_snapshot(
            DBInstanceIdentifier=drill_instance_id,
            DBSnapshotIdentifier=latest["DBSnapshotIdentifier"],
            DBInstanceClass="db.r6g.large",
            MultiAZ=False,
            PubliclyAccessible=False,
            DeletionProtection=False,
        )

        waiter = self.rds.get_waiter("db_instance_available")
        waiter.wait(DBInstanceIdentifier=drill_instance_id)

        elapsed = (time.time() - start) / 60
        if elapsed > self.rto_minutes:
            raise AssertionError(
                f"RTO breach: restore took {elapsed:.1f}m, target {self.rto_minutes}m"
            )

        return {
            "status": "PASS",
            "rto_actual_minutes": round(elapsed, 1),
            "rpo_actual_minutes": round(snapshot_age, 1),
            "snapshot_id": latest["DBSnapshotIdentifier"],
        }

Provision DR infrastructure and DNS failover as code

Define secondary region or account stacks as module variants of production and sync them through the same CI gates. Scheduled terraform plan in the DR workspace catches template drift before an incident. Route 53 failover routing moves traffic when the primary health check fails—use fqdn-based HTTPS checks or rely on ALB target health evaluation with alias records. DNS TTL and resolver caching still add minutes to cutover; factor that into RTO budgets.

GitHub Actions · DR stack plan and controlled apply
name: DR Infrastructure Sync
on:
  push:
    paths: ['infrastructure/**']
  schedule:
    - cron: '0 */6 * * *'
  workflow_dispatch:

jobs:
  dr-sync:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_wrapper: false

      - name: Terraform plan (DR region)
        working-directory: infrastructure/dr
        run: |
          terraform init -input=false
          terraform plan -detailed-exitcode -input=false -var="environment=dr" -out=dr.tfplan

      - name: Apply DR infrastructure
        if: github.ref == 'refs/heads/main'
        working-directory: infrastructure/dr
        run: terraform apply -input=false dr.tfplan
HCL · Route 53 failover records for primary and DR ALB
resource "aws_route53_health_check" "primary" {
  fqdn              = var.primary_healthcheck_fqdn
  port              = 443
  type              = "HTTPS"
  resource_path     = "/health"
  failure_threshold = 3
  request_interval  = 30

  tags = {
    Name = "primary-api-health-check"
  }
}

resource "aws_route53_record" "failover_primary" {
  zone_id        = var.zone_id
  name           = "api.example.com"
  type           = "A"
  set_identifier = "primary"

  failover_routing_policy {
    type = "PRIMARY"
  }

  alias {
    name                   = var.primary_alb_dns
    zone_id                = var.primary_alb_zone_id
    evaluate_target_health = true
  }

  health_check_id = aws_route53_health_check.primary.id
}

resource "aws_route53_record" "failover_secondary" {
  zone_id        = var.zone_id
  name           = "api.example.com"
  type           = "A"
  set_identifier = "secondary"

  failover_routing_policy {
    type = "SECONDARY"
  }

  alias {
    name                   = var.dr_alb_dns
    zone_id                = var.dr_alb_zone_id
    evaluate_target_health = true
  }
}

Orchestrate recovery with an explicit dependency graph

Tie validation, data restore, cache warm-up, DNS cutover, and post-cutover health checks into a workflow you can trigger during game days. Argo Workflows suits Kubernetes estates; Step Functions or a CI workflow may fit simpler stacks. Keep human approval gates for declaring disaster and executing DNS failover in production—these are policy decisions, not fully autonomous actions.

YAML · Argo Workflows disaster recovery DAG
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: disaster-recovery
spec:
  entrypoint: recover
  templates:
    - name: recover
      dag:
        tasks:
          - name: validate-dr-infra
            template: validate-infra
          - name: restore-databases
            template: restore-db
            dependencies: [validate-dr-infra]
          - name: sync-cache-layer
            template: sync-cache
            dependencies: [restore-databases]
          - name: update-dns
            template: dns-failover
            dependencies: [sync-cache-layer]
          - name: verify-recovery
            template: health-check
            dependencies: [update-dns]

    - name: validate-infra
      container:
        image: registry.example.com/dr-tools:latest
        command: [python, /scripts/validate_dr_infra.py]

    - name: restore-db
      container:
        image: registry.example.com/dr-tools:latest
        command: [python, /scripts/restore_database.py, --from-latest-snapshot, --verify-rpo]

    - name: sync-cache
      container:
        image: registry.example.com/dr-tools:latest
        command: [python, /scripts/warm_cache.py]

    - name: dns-failover
      container:
        image: registry.example.com/dr-tools:latest
        command: [python, /scripts/failover_dns.py]

    - name: health-check
      container:
        image: registry.example.com/dr-tools:latest
        command: [python, /scripts/verify_recovery.py]

Operational practices that keep DR credible

Run timed game days monthly in an isolated account or region, not annually from a slide deck. Version DR templates beside production and review DR impact in the same pull request when adding data stores. Alert when last successful backup age exceeds RPO-derived thresholds. Store backups in immutable object storage such as S3 Object Lock as ransomware defense. Right-size DR spend: pilot light or warm standby for most workloads, active-active only when the business requires sub-minute RTO. Audit backup encryption, retention, and cross-region copy with AWS Config or OPA continuously. Document who may declare a disaster and when to fail over versus fail back. DR is ongoing practice—DRaC makes it testable instead of theoretical.

RTO and RPO targets should be explicit reliability commitments, which we define in our SLO, SLI, and error budget guide for platform teams.

Recovery drills pair naturally with controlled failure practice from our Chaos Engineering playbook for DevOps.