DR Test Results

Tested: April 2026

Scaled the primary to 0 and watched what happened. Route 53 failed over automatically. No manual steps.

RTO:      83 seconds (Route 53 ~61s + ALB health check ~22s)
RPO:      ~60 seconds (DynamoDB ticker interval, not data loss)
Failover: automatic

How the test was run

Two terminal loops running in parallel — one watching HTTP traffic and which region was serving, one watching DynamoDB writes to confirm the data plane shifted too.

# Terminal 1 — HTTP traffic + region
while true; do
  IP=$(dig +short app.jspoth.com | head -1)
  REGION=$(dig +short -x $IP | grep -oE 'us-(east|west)-[0-9]')
  STATUS=$(curl -sI https://app.jspoth.com/health/live | head -1 | tr -d '\r')
  echo "$(date +%T) | $STATUS | $REGION ($IP)"
  sleep 5
done

# Terminal 2 — DynamoDB writes
while true; do
  clear
  aws dynamodb scan --table-name app_events --region us-east-2 \
    --query "sort_by(Items[*].{region:region.S, timestamp:timestamp.S}, ×tamp)[-1::-1]" \
    --output table
  sleep 30
done

What happened

07:28:42 | HTTP/2 503  | us-east-2 (3.132.95.107)   ← scaled to 0, primary ALB unhealthy
07:28:47 | HTTP/2 503  | us-east-2 (3.132.95.107)
07:28:53 | HTTP/2 200  | us-east-2 (3.132.95.107)   ← briefly 200 (in-flight requests draining)
07:28:59 | HTTP/2 200  | us-east-2 (3.132.95.107)
07:29:04 | HTTP/2 503  | us-east-2 (3.132.95.107)   ← primary fully down
07:29:10 | HTTP/2 503  | us-east-2 (18.217.129.208)
07:29:15 | HTTP/2 503  | us-east-2 (18.217.129.208)
07:29:21 | HTTP/2 503  | us-east-2 (18.217.129.208)
07:29:26 | HTTP/2 503  | us-east-2 (18.217.129.208)
07:29:32 | HTTP/2 503  | us-east-2 (18.217.129.208)
07:29:37 | HTTP/2 503  | us-east-2 (18.217.129.208)
07:29:43 | HTTP/2 503  | us-west-2 (184.32.80.19)   ← Route 53 failed over (~61s)
07:29:48 | HTTP/2 503  | us-west-2 (184.32.80.19)   ← DR ALB health check still failing (404)
07:29:54 | HTTP/2 503  | us-west-2 (184.32.80.19)
07:29:59 | HTTP/2 503  | us-west-2 (184.32.80.19)
07:30:05 | HTTP/2 200  | us-west-2 (184.32.80.19)   ← DR serving normally
07:30:10 | HTTP/2 200  | us-west-2 (184.32.80.19)
07:30:15 | HTTP/2 200  | us-west-2 (52.13.238.180)
07:30:53 | HTTP/2 200  | us-west-2 (52.13.238.180)  ← stable

DynamoDB confirmed writes shifted to us-west-2:

| us-west-2 | 2026-04-26T14:30:08Z |
| us-east-2 | 2026-04-26T14:28:57Z |  ← last primary write
| us-west-2 | 2026-04-26T14:27:57Z |
| us-west-2 | 2026-04-26T14:21:57Z |  ← DR writing continuously before test

RTO breakdown

├── 07:28:42  primary scaled to 0
├── 07:29:43  Route 53 failed over to us-west-2     (+61s)
├── 07:30:05  DR ALB healthy, serving 200            (+22s)
└── Total: 83 seconds

To reduce RTO further

Route 53 check interval is already at the 30s minimum. Fast health checks (10s) cost extra.
Pre-warming the ALB target group eliminates most of that 22s registration delay.
Dropping the DynamoDB ticker from 60s to 10s reduces RPO with no architecture changes.

SQS during the failover

SQS doesn't replicate across regions. Messages in the primary queue at the moment of failure stay there. Messages after failover go to the DR queue and process normally. The gap is whatever was sitting unprocessed at the moment things went down.