A complete microservices platform running on Amazon EKS — built from scratch to demonstrate what production infrastructure actually looks like, not what tutorials pretend it looks like.

The Problem

Every DevOps portfolio project looks the same. Terraform creates a VPC, spins up an EKS cluster, deploys nginx, and calls it “production-ready.” The README has 30 bullet points of AWS services. The repo has one commit that adds 200 files. Nobody learned anything building it, and interviewers can tell.

I wanted to build something different. Something that answers the questions that actually come up when you run infrastructure for a team:

  • How do pods get database passwords without anyone putting credentials in Git?
  • What happens when your Spot instances get a 2-minute termination notice at 3 AM?
  • How do you know if a service is slow because of a bad query or because a node is running out of memory?
  • How do you make sure a junior dev can’t accidentally kubectl delete namespace production?

These aren’t theoretical questions. They’re the difference between a portfolio project and a real platform.


What I Built

Five microservices — UI, Catalog, Cart, Orders, Checkout — each with its own database, deployed on EKS across three availability zones. The kind of setup you’d find at a mid-size company running retail services on AWS, with the same constraints around cost, security, and operational readiness.


Infrastructure That Actually Makes Decisions

The VPC has three subnet tiers, not two. Public subnets for ALBs. Private app subnets for EKS nodes. Private data subnets for RDS and ElastiCache.

Why? Because when an EKS node gets compromised, the attacker lands in the app tier. They can reach the database port, but they’re in a different subnet with a different security group — they still need credentials from Secrets Manager, which requires a Pod Identity role they don’t have.

Every security group uses standalone rules instead of inline blocks. This matters because Terraform destroys and recreates a security group when you modify an inline rule — cascading through every resource that references it. Standalone rules are additive. No downtime, no cascading destroys.

KMS has three separate keys — one for S3, one for EKS etcd, one for RDS. Revoking one doesn’t break the others. All toggled off in dev to save $3/month, on in prod. Same code, different variable.


Secrets Nobody Ever Sees

Terraform generates database passwords with random_password. Stores them in AWS Secrets Manager. The Secrets Store CSI Driver fetches them at pod startup using the service account’s Pod Identity role and mounts them as Kubernetes Secrets.

No passwords in Git. No passwords in Helm charts. No passwords typed into kubectl. No human ever sees the password.

If a pod tries to access a secret it shouldn’t have, it gets AccessDeniedException — not a silent fallback to an overpermissioned node role. That’s why I chose Pod Identity over IRSA. IRSA fails silently. Pod Identity fails loudly.


Autoscaling That Thinks

HPA watches CPU utilisation and scales pods. But more pods need more nodes. Karpenter watches for unschedulable pods and provisions a right-sized node in about 60 seconds — not 3-5 minutes like Cluster Autoscaler. If a pod needs 256 MB, Karpenter picks a t3.micro, not a t3.large.

Two NodePools:

  • On-demand for baseline stability
  • Spot for burst capacity at 60-70% savings

An SQS queue catches spot interruption warnings via EventBridge. Karpenter drains the node before AWS reclaims it. The pod moves to another node. Users don’t notice.


GitOps, Not “Push and Pray”

ArgoCD watches the Helm values in Git. Push a change, ArgoCD syncs it to the cluster.

  • selfHeal: true — if someone kubectl edits a deployment in production, ArgoCD reverts it within seconds
  • prune: true — if you delete a resource from Git, ArgoCD deletes it from the cluster
  • Rollback is git revert, not “find the last working image tag”

GitHub Actions builds the image, pushes to ECR, updates the Helm values file. ArgoCD picks it up. No kubectl apply in CI pipelines. No imperative commands. Git is the single source of truth.


Observability From Day One

ADOT collectors run as DaemonSets. Traces go to X-Ray — you can follow a single request from the ALB through the UI service, into the Catalog API, down to the MySQL query.

SignalDestinationDev RetentionProd Retention
TracesAWS X-Ray14 days90 days
LogsCloudWatch14 days90 days
MetricsAmazon Managed Prometheus → GrafanaReal-timeReal-time

When latency spikes on checkout, you don’t guess. You look at the trace, find the slow span, check if it’s the Redis connection or the SQS publish, and fix it.


The Numbers

DevProd
Monthly cost~$293~$465
KMS encryptionOffOn
WAFOffOn
Multi-AZ RDSOffOn
NAT Gateways13
VPC endpointsOffOn

Same code. Same modules. Different terraform.tfvars.


Stack

LayerTechnology
CloudAWS (EKS, RDS, ElastiCache, SQS, Secrets Manager, KMS)
IaCTerraform (reusable modules, per-environment configs)
ContainersDocker (multi-stage, non-root, multi-arch)
OrchestrationKubernetes on EKS, Helm charts, ArgoCD
AutoscalingHPA + Karpenter (on-demand + spot)
SecurityKMS, WAF, NetworkPolicies, Pod Identity, IMDSv2
ObservabilityOpenTelemetry, X-Ray, CloudWatch, Prometheus, Grafana
CI/CDGitHub Actions → ECR → ArgoCD

Key Architectural Decisions

  • Pod Identity over IRSA — fails loudly instead of silently falling back to node role
  • 3-tier subnets — isolates app and data layers so compromised nodes can’t reach databases directly
  • Karpenter over Cluster Autoscaler — right-sizes nodes in 60 seconds, not 3-5 minutes
  • Standalone security group rules — prevents Terraform cascading destroys on rule changes
  • Separate KMS keys per service — revoking one doesn’t break the others

Full decision log: DECISIONS.md