Jay — Site Reliability Engineer
Last updated: April 2026 ·
15+ years in SRE and platform engineering. Exploring senior or staff SRE / Platform Engineering roles (remote or SF Bay Area). Open to a technical deep-dive on any of the projects below.
I gave Claude Code a project plan and had it build phase by phase — EKS, Terraform, Go app, SQS pipeline, multi-region DR — with me steering the architecture and Claude doing the implementation. What used to take weeks took days.
Building infra from scratch used to take weeks. Not because the concepts are hard — it's the execution loop. Write Terraform, hit an error, dig through docs, fix it, hit the next one. Every component has its own quirks. It compounds fast.
Go app on EKS, everything provisioned with Terraform, deployed via GitHub Actions. Event-driven SQS pipeline, config managed through SSM and External Secrets Operator, Karpenter handling node scaling.
AI handled the heavy lifting but passing terraform plan doesn't mean it works.
Karpenter CRDs applied before the cluster was ready, wrong CPU architecture on first deploy,
ALB health check misconfigured, DR duplicated as a folder copy with no flag that Terragrunt exists.
Config that looks right and breaks at runtime — and things AI won't volunteer unless you ask.
Five specific examples →
Pilot light DR in us-west-2. Route 53 health checks trigger automatic failover. The infrastructure was mirrored with Terraform and deployed identically across both regions — same app, same pipeline, different variables. Validated with a live test: scaled the primary to zero and watched what happened.
The biggest strength of cloud infrastructure is how easy it is to scale on demand — giving organizations the ability to test and iterate quickly. But this is a double-edged sword. Costs can spiral out of control: resources accumulate quietly, nobody rightsizes after traffic drops, dev infrastructure gets forgotten. This is where AI can help. One of the biggest strengths of LLMs is analysis — given data, they can reason about it, draw conclusions, and recommend actions.
A Go service runs as a Kubernetes CronJob, scanning AWS resources and sending findings to Claude via Bedrock. The model reasons about each resource — whether it's wasteful, why, and what to do. AI identified ~$120–130/month of waste in my test environment. See the details here →