Tested: April 26, 2026
Scaled the primary to 0 and watched what happened. Route 53 failed over automatically. No manual steps.
RTO: 83 seconds (Route 53 ~61s + ALB health check ~22s) RPO: ~60 seconds (DynamoDB ticker interval, not data loss) Failover: automatic
Two terminal loops running in parallel — one watching HTTP traffic and which region was serving, one watching DynamoDB writes to confirm the data plane shifted too.
# Terminal 1 — HTTP traffic + region
while true; do
IP=$(dig +short app.jspoth.com | head -1)
REGION=$(dig +short -x $IP | grep -oE 'us-(east|west)-[0-9]')
STATUS=$(curl -sI https://app.jspoth.com/health/live | head -1 | tr -d '\r')
echo "$(date +%T) | $STATUS | $REGION ($IP)"
sleep 5
done
# Terminal 2 — DynamoDB writes
while true; do
clear
aws dynamodb scan --table-name app_events --region us-east-2 \
--query "sort_by(Items[*].{region:region.S, timestamp:timestamp.S}, ×tamp)[-1::-1]" \
--output table
sleep 30
done
07:28:42 | HTTP/2 503 | us-east-2 (3.132.95.107) ← scaled to 0, primary ALB unhealthy 07:28:47 | HTTP/2 503 | us-east-2 (3.132.95.107) 07:28:53 | HTTP/2 200 | us-east-2 (3.132.95.107) ← briefly 200 (in-flight requests draining) 07:28:59 | HTTP/2 200 | us-east-2 (3.132.95.107) 07:29:04 | HTTP/2 503 | us-east-2 (3.132.95.107) ← primary fully down 07:29:10 | HTTP/2 503 | us-east-2 (18.217.129.208) 07:29:15 | HTTP/2 503 | us-east-2 (18.217.129.208) 07:29:21 | HTTP/2 503 | us-east-2 (18.217.129.208) 07:29:26 | HTTP/2 503 | us-east-2 (18.217.129.208) 07:29:32 | HTTP/2 503 | us-east-2 (18.217.129.208) 07:29:37 | HTTP/2 503 | us-east-2 (18.217.129.208) 07:29:43 | HTTP/2 503 | us-west-2 (184.32.80.19) ← Route 53 failed over (~61s) 07:29:48 | HTTP/2 503 | us-west-2 (184.32.80.19) ← DR ALB health check still failing (404) 07:29:54 | HTTP/2 503 | us-west-2 (184.32.80.19) 07:29:59 | HTTP/2 503 | us-west-2 (184.32.80.19) 07:30:05 | HTTP/2 200 | us-west-2 (184.32.80.19) ← DR serving normally 07:30:10 | HTTP/2 200 | us-west-2 (184.32.80.19) 07:30:15 | HTTP/2 200 | us-west-2 (52.13.238.180) 07:30:53 | HTTP/2 200 | us-west-2 (52.13.238.180) ← stable
DynamoDB confirmed writes shifted to us-west-2:
| us-west-2 | 2026-04-26T14:30:08Z | | us-east-2 | 2026-04-26T14:28:57Z | ← last primary write | us-west-2 | 2026-04-26T14:27:57Z | | us-west-2 | 2026-04-26T14:21:57Z | ← DR writing continuously before test
├── 07:28:42 primary scaled to 0 ├── 07:29:43 Route 53 failed over to us-west-2 (+61s) ├── 07:30:05 DR ALB healthy, serving 200 (+22s) └── Total: 83 seconds
SQS doesn't replicate across regions. Messages in the primary queue at the moment of failure stay there. Messages after failover go to the DR queue and process normally. The gap is whatever was sitting unprocessed at the moment things went down.