Jay — Site Reliability Engineer

Building Cloud Infrastructure with AI

Last updated: April 2026 ·

15+ years in SRE and platform engineering. Exploring senior or staff SRE / Platform Engineering roles (remote or SF Bay Area). Open to a technical deep-dive on any of the projects below.

I gave Claude Code a project plan and had it build phase by phase — EKS, Terraform, Go app, SQS pipeline, multi-region DR — with me steering the architecture and Claude doing the implementation. What used to take weeks took days.

Building Infrastructure with AI - Where AI got it wrong

Building infra from scratch used to take weeks. Not because the concepts are hard — it's the execution loop. Write Terraform, hit an error, dig through docs, fix it, hit the next one. Every component has its own quirks. It compounds fast.

Go app on EKS, everything provisioned with Terraform, deployed via GitHub Actions. Event-driven SQS pipeline, config managed through SSM and External Secrets Operator, Karpenter handling node scaling.

AI handled the heavy lifting but passing terraform plan doesn't mean it works. Karpenter CRDs applied before the cluster was ready, wrong CPU architecture on first deploy, ALB health check misconfigured, DR duplicated as a folder copy with no flag that Terragrunt exists. Config that looks right and breaks at runtime — and things AI won't volunteer unless you ask. Five specific examples →

System Overview

Ingress

Route 53 · ALB

↓

Compute

Go Service · SQS Consumer
EKS (Karpenter)

↓

Data

DynamoDB · SQS Queue
SSM Parameter Store

Multi-Region Disaster Recovery

Pilot light DR in us-west-2. Route 53 health checks trigger automatic failover. The infrastructure was mirrored with Terraform and deployed identically across both regions — same app, same pipeline, different variables. Validated with a live test: scaled the primary to zero and watched what happened.

DR validated — April 2026: Route 53 failed over to us-west-2 in ~61s, full recovery in 83s. DynamoDB writes shifted regions automatically. No manual steps. Full test results →

DR Topology

us-east-2 — Primary

EKS · ALB · DynamoDB

↓ Route 53 ~61s

us-west-2 — DR

Pilot light · auto-scaled

61s

DNS Failover

83s

Full RTO

AI-Powered Cost Optimization

The biggest strength of cloud infrastructure is how easy it is to scale on demand — giving organizations the ability to test and iterate quickly. But this is a double-edged sword. Costs can spiral out of control: resources accumulate quietly, nobody rightsizes after traffic drops, dev infrastructure gets forgotten. This is where AI can help. One of the biggest strengths of LLMs is analysis — given data, they can reason about it, draw conclusions, and recommend actions.

A Go service runs as a Kubernetes CronJob, scanning AWS resources and sending findings to Claude via Bedrock. The model reasons about each resource — whether it's wasteful, why, and what to do. AI identified ~$120–130/month of waste in my test environment. See the details here →

Cost Optimizer

Scanners

NAT · ALB · EIP
EKS Nodes · Cluster

↓

Analysis

Claude via Bedrock
7-day CloudWatch lookback

↓

Findings

5 resources flagged

~$130

Waste / Month

Resources