Black Friday Resilient E-Commerce Architecture on AWS
📌 Index
-
Quick Summary
- Defence-in-depth overview
- Key technical components
-
Before and After Security Layer
- Legacy architecture
- Hardened architecture with SG chaining, mTLS, IRSA
-
Security Overview of Small-Scale Platform: ShopNow
- Audit findings
- Phased security program
- Outcomes
- Deep Dives (Hard, Interview-Relevant Parts)
-
Supply Chain, CI/CD & Runtime Controls
- CI/CD identity hardening (OIDC)
- Image scanning & signing
- SAST/SCA in CI gates
- Admission controllers
- Runtime detection & immutable infra
-
Data Protection and PII / Payments
- Encryption at rest & in transit
- S3 hardening
- Payment and PII handling
-
Threat Model (Top Threats + Mitigations)
- Account takeover / credential theft
- Lateral movement from compromised pod
- Data exfiltration
- Supply chain compromise
- DDoS / Flash sale disruption
- Ransomware / snapshot deletion
- Compliance Checklist (PCI / GDPR / HIPAA Highlights)
-
Prioritized Rollout Plan (Practical, with Tasks)
- Quick wins (days → 2 weeks)
- Mid-term (2–8 weeks)
- Long-term (2–6 months)
-
Operational Rules, KPIs & Runbook Snippet
- Policies to codify
- KPIs / SLOs
- Runbook example — Suspicious S3 exfil detection
-
Final Thoughts & Recommended Next Steps
- Lessons learned
- Operational and security takeaways
1 — Quick Summary
E-commerce platforms handle vast amounts of sensitive data — including PII, payment flows, and inventory logic — making them extremely high-value targets. During global flash sales like Amazon Great Indian Festival
, Flipkart Big Billion Days
and Black Friday
, these platforms face hundreds of millions of requests within hours, testing both performance and security at massive scale.
I was fascinated by how these large-scale platforms maintain resilience and security under such intense traffic, ensuring seamless shopping experiences while protecting customer data and financial transactions. The key lies in defence-in-depth — layering security and operational controls across edge, network, platform, application, and data layers, all while enforcing least privilege, zero-trust, and observability.
Defence-in-depth means layering protections at the edge, network, platform, application, and data layers while designing for least privilege, zero-trust, and operational observability.
The most technical, high-impact components that separate advanced designs from beginner ones are:
- mTLS for east-west trust between services (service identity)
- Security Group chaining for micro-segmentation
- WAF tuning at CloudFront edge (not just default rules)
- Shield Advanced for flash sale / DDoS resilience
- IRSA (IAM Roles for Service Accounts) to give Kubernetes pods least privilege AWS access
- Supply-chain controls (image signing, admission controllers, SCA/SAST)
2 — Let’s take a peak at before and after Security Layer (Overview)
Before
[Internet]
│
[CloudFront/ALB] (sg_alb) --> inbound 0.0.0.0/0:443
│
+-- sg_alb (Ingress: 0.0.0.0/0:443) --> accepts any IP
│
[VPC]
├─ EKS Worker Nodes / EC2 (sg_node)
│ ├─ Pods --> (no IRSA, use node role)
│ └─ sg_node --> (Ingress: 0.0.0.0/0:22, 0.0.0.0/0:443, Egress: 0.0.0.0/0:*)
│
├─ RDS (sg_rds)
│ └─ sg_rds --> (Ingress: 0.0.0.0/0:3306) <-- DB widely open
│
├─ ElastiCache (sg_cache)
│ └─ sg_cache --> (Ingress: 0.0.0.0/0:6379)
│
└─ Kafka (sg_kafka)
└─ sg_kafka --> (Ingress: 0.0.0.0/0:9092)
After
[Internet]
│
[CloudFront + WAF + Shield] --> only CloudFront IPs / WAF validated requests
│
[ALB / API Gateway] (sg_alb)
├─ Ingress --> CloudFront only (or ALB origin: CF signature header) | Only allow 443
│
[VPC (10.0.0.0/16)]
├─ Public Subnets
│ ├─ ALB (sg_alb) --> Ingress: CloudFront
│ └─ NAT (for egress) --> (managed)
│
└─ Private Subnets (per AZ)
├─ EKS Worker Nodes (sg_node) --> Ingress: sg_alb (443) via node ports only
│ └─ Pods --> (networkPolicy + per-service SG-model or node SG + strict NP)
│ - svc-orders (sg_orders) --> IRSA: OrdersRole
│ - svc-payments (sg_payments) --> IRSA: PaymentsRole
│ - svc-search (sg_search) --> IRSA: SearchRole
│ - envoy sidecars (mTLS enforced)
│
└─ Private DB Subnets
├─ RDS Aurora (sg_db) --> Ingress: sg_app (3306/5432) only
├─ ElastiCache (sg_cache) --> Ingress: sg_app (6379) only
└─ Kafka (sg_kafka) --> Ingress: sg_app (9092) only
3 — Real Time: Security Overview of Small-Scale Platform
For example ShopNow is a mid-stage e-commerce startup.
Audit findings:
- Shared node IAM roles
- Plaintext east-west traffic
- SGs with wide CIDRs
- Minimally tuned WAF
Phased security program executed:
- Edge first — CloudFront + WAF tuned + Shield Advanced.
→ Immediately reduced malicious bots and offered DDoS response. - Network segmentation — SG chaining + SSM (no SSH).
→ Stopped direct DB reachability. - IRSA roll-out — pod-level AWS permissions, tight trust policies, CloudTrail monitoring of role usage.
- mTLS deployment — App Mesh with Envoy for payments & orders first, certs via ACM PCA, then full mesh.
- Supply chain — ECR scanning, image signing (cosign), OPA Gatekeeper admission control, CI OIDC federation.
Outcome:
- MTTD reduced from days to <1 hour
- Phishing/data exfil prevented
- Zero-trust posture established for production
4 — Deep Dives (the hard, interview-relevant parts)
4.1 TLS vs mTLS — Why and How (Advanced)
TLS (client → server)
- Protects client → edge communications.
- Use ACM for browser/mobile certificates.
- TLS 1.2+ or 1.3, secure ciphers, HSTS.
mTLS (mutual TLS — east-west)
- Both parties present and validate certs.
- Creates service identity, not just network identity.
- Prevents internal impersonation: compromised pod cannot claim identity of
payments-service
without its cert.
How to implement at scale
- Service mesh: Istio or AWS App Mesh + Envoy sidecars. Sidecar handles mTLS, cert rotation and telemetry.
- Certificate authority: ACM Private CA or Vault PKI. Use short TTLs (days) and auto-rotation.
- Policy enforcement: Deny plaintext by default. Enforce mTLS via mesh PeerAuthentication & DestinationRule equivalents.
- Observability: Integrate Envoy metrics + OpenTelemetry traces. Since payloads are encrypted, sidecar telemetry is essential.
Flow: mTLS in App Mesh
- Pod spins up (e.g.,
payments-service
). - Envoy sidecar requests a cert from ACM PCA (via App Mesh integration).
- Cert is mounted into Envoy → Envoy advertises identity
spiffe://payments
. - When
orders
callspayments
:- Orders Envoy presents its cert (
spiffe://orders
). - Payments Envoy validates it against ACM PCA trust bundle.
- Both sides verify each other (mutual TLS).
- Orders Envoy presents its cert (
- If cert check fails (wrong service, expired cert, revoked cert), traffic is dropped.
Visual
Without mTLS (flat trust)
orders -----> payments
(any pod in VPC can spoof calls)
With mTLS
[orders pod]--[Envoy] <==TLS+cert==> [Envoy]--[payments pod]
| |
Cert (orders) Cert (payments)
issued by ACM PCA issued by ACM PCA
Operational Pitfalls & Mitigations (mTLS)
-
Pitfall: Certificate expiry causing outages
→ Mitigation: Stagger rollouts, automated rotation, readiness checks. -
Pitfall: Debugging complexity
→ Mitigation: Sidecar debug tooling, mutual TLS verification endpoints, and strong logging (Envoy + X-Ray/OpenTelemetry).
4.2 IRSA (IAM Roles for Service Accounts) — Deep Technical Notes
Problem IRSA Solves
Without IRSA, pods inherit node IAM roles.
- This leads to over-privilege (every pod on the node can use the same role).
- Increases lateral movement risk if a pod is compromised.
How IRSA Works
- EKS cluster has an OIDC provider (cluster OIDC).
- A Kubernetes ServiceAccount is annotated with an IAM Role ARN.
- Pod uses a web identity token → AWS STS issues temporary credentials for that role.
Best Practices
- Scope trust policy: Restrict
sts:AssumeRoleWithWebIdentity
Condition to:aud
= cluster OIDC audiencesub
= exact ServiceAccount ARN or namespace
- Least privilege IAM: One role per ServiceAccount with minimal permissions.
- Tag sessions: Require
iam:TagSession
→ add environment tags (e.g.,env=prod
) for conditional permissions. - Monitor: Use CloudTrail + Insights to detect abnormal role usage (frequency, geolocation).
- Rotate: Review roles/policies in CI/CD pipelines; enforce automated linters (
policy-sentry
,terraform-compliance
).
Example Trust Policy (Conceptual)
{
"Effect": "Allow",
"Principal": {
"Federated": "arn:aws:iam::111111111111:oidc-provider/oidc.eks.region.amazonaws.com/id/CLUSTER_ID"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"oidc.eks.region.amazonaws.com/id/CLUSTER_ID:sub": "system:serviceaccount:orders-ns:orders-sa",
"oidc.eks.region.amazonaws.com/id/CLUSTER_ID:aud": "sts.amazonaws.com"
}
}
}
Common Failure Modes
- Wrong OIDC URL or audience → Pods cannot assume role.
- Trust too broad (wildcard sub) → Risk of impersonation.
4.3 Security Groups (SG) — Advanced Patterns & Egress Control
Principles
- SG-to-SG referencing over CIDR → more dynamic, reduces hardcoding of IPs.
- Principle of least privilege: minimal ports, minimal sources.
- Example:
- ALB → App: allow 443 only
- App → DB: allow DB port only
- Example:
- Egress deny-by-default:
- Allow only explicitly required external endpoints:
- VPC Endpoints for S3, ECR, KMS
- Blocks arbitrary internet exfiltration
- Allow only explicitly required external endpoints:
Example Topology
- sg_alb
- Ingress: CloudFront IP space (or 0.0.0.0/0 if behind CF)
- Prefer WAF header verification over raw IP restrictions
- sg_app
- Ingress: from
sg_alb
- Egress: to
sg_db
,sg_cache
- Ingress: from
- sg_db
- Ingress: from
sg_app
only
- Ingress: from
- sg_management
- Egress: AWS SSM endpoints
- Admin access via SSM Session Manager, not open SSH
Automation & Drift
- AWS Config rules → detect overly permissive/public SGs
- IaC GitOps (Terraform, CDK) → source of truth
- Policy guardrails → OPA/Sentinel to prevent
0.0.0.0/0
drift - Nightly drift detection → reconcile SG state with IaC
4.4 WAF (CloudFront Edge) — Advanced Tuning
Why CloudFront + WAF
- Stops malicious traffic before it reaches ALB/API Gateway
- Protects origin cost, performance, and blast radius
Advanced Controls
- Managed rulesets (AWS / 3rd-party) → OWASP protection
- Rate limiting: fine-grained thresholds (e.g.,
/checkout
stricter than/home
) - Geo + bot filtering:
- Block scrapers / bad UAs
- Allowlist trusted marketplaces
- Header & cookie validation:
- Enforce presence of custom app headers
- Block invalid mobile signatures
- Progressive challenges:
- CAPTCHA or JavaScript challenge for signup / forgot password endpoints
Rollout Strategy
- Use COUNT mode first to measure false positives
- Switch to BLOCK after validation in staging/prod
- Canary rollout: staged CloudFront distributions with updated WAF
4.5 Shield Advanced — Justification & Playbook
When to Use
- High-value events (Black Friday, seasonal spikes)
- When scaling costs from attacks could be financially damaging
- Need AWS DDoS Response Team (DRT) involvement
Features
- DDoS cost protection (credits for scale-up during attack)
- DRT integration with custom WAF rulesets
- Global dashboards with real-time threat intelligence
Integration
- Combine Shield Advanced + Route 53 health checks
- Failover routing to backup region during multi-region attack
- Layer with WAF → block malicious sources early
4.6 Use SSM over SSH for Bastion Host
SSH (Legacy)
- Works via key pair / password on port 22
- Security surface:
- Requires port 22 open (brute-force + scanning risk)
- Keys/passwords are long-lived secrets → leak/rotation risks
- Manual
.ssh/authorized_keys
management - Weak audit — no native per-command logs
- Bastion compromise = high lateral movement risk
SSM (AWS Systems Manager Session Manager)
- Works via AWS API (HTTPS 443), no inbound ports
- Security surface:
- No port 22 exposure → closed SG
- No keys → IAM role-based auth
- Strong audit: sessions logged to CloudTrail, CloudWatch Logs, S3
- Enforce MFA on IAM before sessions allowed
- Just-in-time: temporary session permissions (e.g., 1 hour)
- Reduced lateral movement → IAM boundaries + scoped access
5 — Supply Chain, CI/CD & Runtime Controls
CI/CD Identity Hardening with OIDC
- Problem: Long-lived IAM keys in CI/CD pipelines → high theft/exfiltration risk.
- Solution: Use OIDC federation for ephemeral credentials.
- GitHub Actions, GitLab CI, or AWS CodeBuild runners request OIDC tokens.
- IAM trust policies validate repo/branch and issue short-lived STS creds.
- No static AWS keys stored in pipeline secrets.
OIDC Flow for GitHub Actions → AWS
[GitHub Actions Runner]
│ (1) Requests OIDC token
▼
[GitHub OIDC Provider]
│ (2) Issues signed OIDC token (short-lived)
▼
[AWS IAM OIDC Federation Trust]
│ (3) Validates token (aud, repo, branch, etc.)
▼
[AWS STS]
│ (4) Issues temporary credentials
▼
[Deploy Role in AWS Account]
Permissions:
- eks:UpdateCluster
- s3:PutObject
- cloudformation:DeployStack
- Image scanning & signing: ECR scan on push + sign images with cosign. Admission controllers enforce signature presence.
- SAST/SCA: integrate Snyk/Trivy/Sonar in CI gates (blocking release on critical CVEs).
- Admission controllers: OPA/Gatekeeper or Kyverno for policies (no hostPath, required resource limits, required image registry signed).
- Runtime detection: Falco / Sysdig for suspicious syscalls, detect container escapes, file changes.
- Immutable infra: Replace AMIs via image pipelines; use managed node groups and node image rotation.
6. Data Protection and PII / Payments
- Encryption at rest: RDS/Aurora, EBS, S3, ElastiCache with CMK (customer-managed) and rotation policy.
- Encryption in transit: TLS everywhere, mTLS for services. Use KMS for envelope encryption.
- S3 hardening: Block public access, bucket policies, Access Points, S3 Object Lock for critical buckets, Macie for PII detection.
- Payment data: Don’t store card PANs — tokenize via PCI-certified provider. If processing requires in-scope infra, isolate payment service in a dedicated VPC/subnet with strict logging and controls.
- PII: Data classification, retention/erase workflows (GDPR SARs) and attribute-level encryption when needed.
7. Threat Model (Top Threats + Mitigations)
1. Account takeover / credential theft
Mitigations: Enforce MFA, strong password policy, monitoring for abnormal console/API access, short-lived tokens.
2. Lateral movement from compromised pod
Mitigations: IRSA, SG chaining, network policies, mTLS, runtime detection (Falco).
3. Data exfiltration from S3/RDS
Mitigations: Block public access, Macie, VPC endpoints, KMS key policy restricting deletion, Egress deny lists.
4. Supply chain compromise (CI/CD)
Mitigations: OIDC federation, artifact signing, SCA/SAST, admission controller.
5. DDoS / Flash sale disruption
Mitigations: CloudFront + WAF, Shield Advanced, Route 53 failover.
6. Ransomware / snapshot deletion
Mitigations: Immutable backups, cross-region replication, restricted KMS keys, MFA Delete on S3.
8. Compliance Checklist (PCI / GDPR / HIPAA Highlights)
PCI DSS (if handling card data)
- Segregated network for payment stack.
- Strict logging/monitoring; retain logs per PCI.
- Third-party PCI-certified processor preferred; if in scope, implement cardholder data environment (CDE) controls, encryption, access logs.
GDPR
- Data minimization, subject access/deletion workflows, data mapping, DPIAs for new features.
- Data residency controls via region selection and cross-region replication policies.
- Data processing agreements and breach notification timelines.
HIPAA (if handling PHI)
- BAA with AWS + any third parties.
- Encryption at rest/in transit, strict access controls, audit logging, and disaster recovery.
9. Prioritized Rollout Plan (Practical, with Tasks)
Quick wins (days → 2 weeks)
- Enforce MFA + disable root/long-lived API keys.
- Enable multi-region CloudTrail + centralized encrypted S3 bucket for logs.
- Block public S3 access; run Macie initial scan.
- Replace SSH bastions with SSM Session Manager; delete public SSH keys.
- Attach AWS WAF at CloudFront with managed OWASP rules + rate limiting (COUNT → BLOCK).
- Enable GuardDuty.
Mid-term (2–8 weeks)
- Implement IRSA for all EKS services (create per-svc roles, trust policies).
- Rework Security Groups into SG-to-SG chains; deny broad AWS SGs/CIDRs.
- ECR scanning + sign images (cosign); add CI checks to fail on critical CVEs.
- Deploy OPA/Gatekeeper admission controller policies (disallow privileged containers, require signed images).
- Configure Security Hub and SIEM ingestion (Security Lake / Splunk / Elastic).
Long-term (2–6 months)
- Deploy full mTLS service mesh (App Mesh / Istio), start with critical services (payments, orders), then full rollout.
- Onboard Shield Advanced and multi-region failover for SLAs around flash sales.
- Build full supply chain: signed artifacts, provenance, automated policy gates, SBOM generation.
- Run continuous red/purple team exercises and chaos security testing.
- Establish continuous drift detection & automatic remediation for SGs/IAM via GitOps.
10. Operational Rules, KPIs & Runbook Snippet
Policies to Codify
- IAM least privilege, role review cadence.
- Secret rotation policy (Secrets Manager / Vault).
- Logging retention policy and access controls.
- Incident response SLAs & notification matrices.
KPIs / SLOs
- Mean Time to Detect (MTTD) target: < 60 minutes (goal: < 15 minutes).
- Mean Time to Remediate (MTTR) target: < 4 hours for critical incidents.
- % of production images scanned & passing: > 99%.
- Number of high-severity CVEs open: target 0.
- Successful DR restore time: target < RTO.
Runbook Snippet — Suspicious S3 Exfil Detection
- Trigger: Macie or GuardDuty flags large GETs from unusual IP + CloudTrail shows new role assumption.
- Contain:
- Immediately add Deny policy to suspected role (using
aws iam put-role-policy
or via AWS Organizations SCP). - Add WAF rule to block the source/gateway IPs or rate limit.
- Immediately add Deny policy to suspected role (using
- Collect:
- Preserve CloudTrail logs, S3 access logs, VPC Flow Logs; create forensic snapshot of involved EC2/EKS nodes.
- Eradicate & Recover:
- Rotate affected secrets, invalidate session tokens, restore compromised objects from last known good backups.
- Post-Mortem:
- Root cause analysis, remediation of the XSS/bug/exposed credentials, update WAF rules, notify legal/CISO if PII affected.
11. Final Thoughts & Recommended Next Steps
This merged, advanced guide shifts the focus from “add services” to how you use them correctly:
- mTLS establishes service identity and prevents internal impersonation. Operationalize it with a mesh.
- IRSA turns pod compromise from catastrophic into survivable by scoping role permissions.
- Security Groups are not a minor detail — they’re the last network fence and must be designed as SG-to-SG chains with egress controls.
- WAF and Shield are resilience layers — tune WAF at edge, use Shield Advanced for critical events.
- Supply chain and CI/CD changes are as important as runtime changes; enforce signing, scanning, and policy gates.