AusMart — Production-Grade Retail Platform on AWS EKS

A complete microservices platform running on Amazon EKS — built from scratch to demonstrate what production infrastructure actually looks like, not what tutorials pretend it looks like.

The Problem

Every DevOps portfolio project looks the same. Terraform creates a VPC, spins up an EKS cluster, deploys nginx, and calls it “production-ready.” The README has 30 bullet points of AWS services. The repo has one commit that adds 200 files. Nobody learned anything building it, and interviewers can tell.

I wanted to build something different. Something that answers the questions that actually come up when you run infrastructure for a team:

How do pods get database passwords without anyone putting credentials in Git?
What happens when your Spot instances get a 2-minute termination notice at 3 AM?
How do you know if a service is slow because of a bad query or because a node is running out of memory?
How do you make sure a junior dev can’t accidentally kubectl delete namespace production?

These aren’t theoretical questions. They’re the difference between a portfolio project and a real platform.

What I Built

Five microservices — UI, Catalog, Cart, Orders, Checkout — each with its own database, deployed on EKS across three availability zones. The kind of setup you’d find at a mid-size company running retail services on AWS, with the same constraints around cost, security, and operational readiness.

Infrastructure That Actually Makes Decisions

The VPC has three subnet tiers, not two. Public subnets for ALBs. Private app subnets for EKS nodes. Private data subnets for RDS and ElastiCache.

Why? Because when an EKS node gets compromised, the attacker lands in the app tier. They can reach the database port, but they’re in a different subnet with a different security group — they still need credentials from Secrets Manager, which requires a Pod Identity role they don’t have.

Every security group uses standalone rules instead of inline blocks. This matters because Terraform destroys and recreates a security group when you modify an inline rule — cascading through every resource that references it. Standalone rules are additive. No downtime, no cascading destroys.

KMS has three separate keys — one for S3, one for EKS etcd, one for RDS. Revoking one doesn’t break the others. All toggled off in dev to save $3/month, on in prod. Same code, different variable.

Secrets Nobody Ever Sees

Terraform generates database passwords with random_password. Stores them in AWS Secrets Manager. The Secrets Store CSI Driver fetches them at pod startup using the service account’s Pod Identity role and mounts them as Kubernetes Secrets.

No passwords in Git. No passwords in Helm charts. No passwords typed into kubectl. No human ever sees the password.

If a pod tries to access a secret it shouldn’t have, it gets AccessDeniedException — not a silent fallback to an overpermissioned node role. That’s why I chose Pod Identity over IRSA. IRSA fails silently. Pod Identity fails loudly.

Autoscaling That Thinks

HPA watches CPU utilisation and scales pods. But more pods need more nodes. Karpenter watches for unschedulable pods and provisions a right-sized node in about 60 seconds — not 3-5 minutes like Cluster Autoscaler. If a pod needs 256 MB, Karpenter picks a t3.micro, not a t3.large.

Two NodePools:

On-demand for baseline stability
Spot for burst capacity at 60-70% savings

An SQS queue catches spot interruption warnings via EventBridge. Karpenter drains the node before AWS reclaims it. The pod moves to another node. Users don’t notice.

GitOps, Not “Push and Pray”

ArgoCD watches the Helm values in Git. Push a change, ArgoCD syncs it to the cluster.

selfHeal: true — if someone kubectl edits a deployment in production, ArgoCD reverts it within seconds
prune: true — if you delete a resource from Git, ArgoCD deletes it from the cluster
Rollback is git revert, not “find the last working image tag”

GitHub Actions builds the image, pushes to ECR, updates the Helm values file. ArgoCD picks it up. No kubectl apply in CI pipelines. No imperative commands. Git is the single source of truth.

Observability From Day One

ADOT collectors run as DaemonSets. Traces go to X-Ray — you can follow a single request from the ALB through the UI service, into the Catalog API, down to the MySQL query.

Signal	Destination	Dev Retention	Prod Retention
Traces	AWS X-Ray	14 days	90 days
Logs	CloudWatch	14 days	90 days
Metrics	Amazon Managed Prometheus → Grafana	Real-time	Real-time

When latency spikes on checkout, you don’t guess. You look at the trace, find the slow span, check if it’s the Redis connection or the SQS publish, and fix it.

The Numbers

	Dev	Prod
Monthly cost	~$293	~$465
KMS encryption	Off	On
WAF	Off	On
Multi-AZ RDS	Off	On
NAT Gateways	1	3
VPC endpoints	Off	On

Same code. Same modules. Different terraform.tfvars.

Stack

Layer	Technology
Cloud	AWS (EKS, RDS, ElastiCache, SQS, Secrets Manager, KMS)
IaC	Terraform (reusable modules, per-environment configs)
Containers	Docker (multi-stage, non-root, multi-arch)
Orchestration	Kubernetes on EKS, Helm charts, ArgoCD
Autoscaling	HPA + Karpenter (on-demand + spot)
Security	KMS, WAF, NetworkPolicies, Pod Identity, IMDSv2
Observability	OpenTelemetry, X-Ray, CloudWatch, Prometheus, Grafana
CI/CD	GitHub Actions → ECR → ArgoCD

Key Architectural Decisions

Pod Identity over IRSA — fails loudly instead of silently falling back to node role
3-tier subnets — isolates app and data layers so compromised nodes can’t reach databases directly
Karpenter over Cluster Autoscaler — right-sizes nodes in 60 seconds, not 3-5 minutes
Standalone security group rules — prevents Terraform cascading destroys on rule changes
Separate KMS keys per service — revoking one doesn’t break the others

Full decision log: DECISIONS.md

The Problem#

What I Built#

Infrastructure That Actually Makes Decisions#

Secrets Nobody Ever Sees#

Autoscaling That Thinks#

GitOps, Not “Push and Pray”#

Observability From Day One#

The Numbers#

Stack#

Key Architectural Decisions#

Links#