A complete microservices platform running on Amazon EKS — built from scratch to demonstrate what production infrastructure actually looks like, not what tutorials pretend it looks like.
The Problem
Every DevOps portfolio project looks the same. Terraform creates a VPC, spins up an EKS cluster, deploys nginx, and calls it “production-ready.” The README has 30 bullet points of AWS services. The repo has one commit that adds 200 files. Nobody learned anything building it, and interviewers can tell.
I wanted to build something different. Something that answers the questions that actually come up when you run infrastructure for a team:
- How do pods get database passwords without anyone putting credentials in Git?
- What happens when your Spot instances get a 2-minute termination notice at 3 AM?
- How do you know if a service is slow because of a bad query or because a node is running out of memory?
- How do you make sure a junior dev can’t accidentally
kubectl delete namespace production?
These aren’t theoretical questions. They’re the difference between a portfolio project and a real platform.
What I Built
Five microservices — UI, Catalog, Cart, Orders, Checkout — each with its own database, deployed on EKS across three availability zones. The kind of setup you’d find at a mid-size company running retail services on AWS, with the same constraints around cost, security, and operational readiness.
Infrastructure That Actually Makes Decisions
The VPC has three subnet tiers, not two. Public subnets for ALBs. Private app subnets for EKS nodes. Private data subnets for RDS and ElastiCache.
Why? Because when an EKS node gets compromised, the attacker lands in the app tier. They can reach the database port, but they’re in a different subnet with a different security group — they still need credentials from Secrets Manager, which requires a Pod Identity role they don’t have.
Every security group uses standalone rules instead of inline blocks. This matters because Terraform destroys and recreates a security group when you modify an inline rule — cascading through every resource that references it. Standalone rules are additive. No downtime, no cascading destroys.
KMS has three separate keys — one for S3, one for EKS etcd, one for RDS. Revoking one doesn’t break the others. All toggled off in dev to save $3/month, on in prod. Same code, different variable.
Secrets Nobody Ever Sees
Terraform generates database passwords with random_password. Stores them in AWS Secrets Manager. The Secrets Store CSI Driver fetches them at pod startup using the service account’s Pod Identity role and mounts them as Kubernetes Secrets.
No passwords in Git. No passwords in Helm charts. No passwords typed into kubectl. No human ever sees the password.
If a pod tries to access a secret it shouldn’t have, it gets AccessDeniedException — not a silent fallback to an overpermissioned node role. That’s why I chose Pod Identity over IRSA. IRSA fails silently. Pod Identity fails loudly.
Autoscaling That Thinks
HPA watches CPU utilisation and scales pods. But more pods need more nodes. Karpenter watches for unschedulable pods and provisions a right-sized node in about 60 seconds — not 3-5 minutes like Cluster Autoscaler. If a pod needs 256 MB, Karpenter picks a t3.micro, not a t3.large.
Two NodePools:
- On-demand for baseline stability
- Spot for burst capacity at 60-70% savings
An SQS queue catches spot interruption warnings via EventBridge. Karpenter drains the node before AWS reclaims it. The pod moves to another node. Users don’t notice.
GitOps, Not “Push and Pray”
ArgoCD watches the Helm values in Git. Push a change, ArgoCD syncs it to the cluster.
selfHeal: true— if someonekubectl edits a deployment in production, ArgoCD reverts it within secondsprune: true— if you delete a resource from Git, ArgoCD deletes it from the cluster- Rollback is
git revert, not “find the last working image tag”
GitHub Actions builds the image, pushes to ECR, updates the Helm values file. ArgoCD picks it up. No kubectl apply in CI pipelines. No imperative commands. Git is the single source of truth.
Observability From Day One
ADOT collectors run as DaemonSets. Traces go to X-Ray — you can follow a single request from the ALB through the UI service, into the Catalog API, down to the MySQL query.
| Signal | Destination | Dev Retention | Prod Retention |
|---|---|---|---|
| Traces | AWS X-Ray | 14 days | 90 days |
| Logs | CloudWatch | 14 days | 90 days |
| Metrics | Amazon Managed Prometheus → Grafana | Real-time | Real-time |
When latency spikes on checkout, you don’t guess. You look at the trace, find the slow span, check if it’s the Redis connection or the SQS publish, and fix it.
The Numbers
| Dev | Prod | |
|---|---|---|
| Monthly cost | ~$293 | ~$465 |
| KMS encryption | Off | On |
| WAF | Off | On |
| Multi-AZ RDS | Off | On |
| NAT Gateways | 1 | 3 |
| VPC endpoints | Off | On |
Same code. Same modules. Different terraform.tfvars.
Stack
| Layer | Technology |
|---|---|
| Cloud | AWS (EKS, RDS, ElastiCache, SQS, Secrets Manager, KMS) |
| IaC | Terraform (reusable modules, per-environment configs) |
| Containers | Docker (multi-stage, non-root, multi-arch) |
| Orchestration | Kubernetes on EKS, Helm charts, ArgoCD |
| Autoscaling | HPA + Karpenter (on-demand + spot) |
| Security | KMS, WAF, NetworkPolicies, Pod Identity, IMDSv2 |
| Observability | OpenTelemetry, X-Ray, CloudWatch, Prometheus, Grafana |
| CI/CD | GitHub Actions → ECR → ArgoCD |
Key Architectural Decisions
- Pod Identity over IRSA — fails loudly instead of silently falling back to node role
- 3-tier subnets — isolates app and data layers so compromised nodes can’t reach databases directly
- Karpenter over Cluster Autoscaler — right-sizes nodes in 60 seconds, not 3-5 minutes
- Standalone security group rules — prevents Terraform cascading destroys on rule changes
- Separate KMS keys per service — revoking one doesn’t break the others
Full decision log: DECISIONS.md
Links
- GitHub: devops-eks-ausmart
- Based on: AWS Retail Store Sample App
- Architecture Decisions: DECISIONS.md