API Gateway
Architecture Diagram
Why It Exists
With a monolith, every client talks to one server. Simple. The moment the architecture moves to microservices, clients suddenly need to know the address of every service, manage auth tokens per service, and handle retries individually. That's a mess.
The API Gateway fixes this. One URL, one TLS termination point, one place to enforce policies. Clients don't need to know (or care) how many services sit behind it.
How It Works
- Request Ingress - Client sends an HTTPS request to the gateway's public endpoint.
- Authentication - The gateway validates JWT/OAuth tokens before routing anything. Bad requests get killed at the edge, which saves backends from doing pointless work.
- Rate Limiting - A token bucket or sliding window algorithm checks per-client quotas. Exceed the limit and back comes a
429 Too Many Requests. No negotiation. - Routing - Path-based or header-based rules map the request to an upstream service. Most modern gateways also support weighted routing, which is how canary deployments work.
- Request Transformation - Headers get added (trace IDs, tenant context), and bodies may be transformed (REST to gRPC, protocol buffers to JSON).
- Response Aggregation - For composite endpoints, the gateway fans out to multiple services, merges the responses, and returns a single payload to the client.
- Caching - Read-heavy endpoints that get hammered can be cached at the gateway layer with configurable TTLs. This is cheap and surprisingly effective.
Decision Criteria
| Factor | Managed (AWS APIGW) | Self-Hosted (Kong/Envoy) |
|---|---|---|
| Ops burden | Low | High |
| Customization | Limited | Full control |
| Latency | ~10-30ms overhead | ~1-5ms overhead |
| Cost at scale | Expensive (per-request) | Infrastructure cost only |
| Vendor lock-in | High | None |
Production Considerations
- Horizontal scaling - Gateways must be stateless. Store rate limit counters in Redis, not in-process memory. The moment state goes into the gateway, it becomes a scaling headache.
- Health checks - Run active health checks against upstream services. Passive checks catch failures after they happen. Active checks prevent routing to dead backends in the first place.
- Graceful degradation - When a downstream service gets slow, the gateway should time out and return a degraded response. Don't hold connections open waiting for a miracle.
- Observability - Every request should emit structured logs with correlation IDs, latency histograms, and error rates per upstream. If it's not possible to tell which upstream is misbehaving within 30 seconds, the observability is broken.
- Blue-green deployments - Gateway routing rules shift traffic between service versions for zero-downtime deploys. This is one of the most practical benefits that comes for free.
Failure Scenarios
Scenario 1: Gateway-Wide TLS Certificate Expiry - A Let's Encrypt auto-renewal job silently fails. When the cert expires, every client gets TLS handshake errors. All external traffic drops to zero within seconds. Mobile apps with certificate pinning fail even after renewal until users update the app. Detection: monitor ssl_certificate_expiry_seconds and alert when it drops below 14 days. Run synthetic probes from external monitors (Datadog Synthetics, Pingdom). Recovery: pre-stage backup certificates in a secrets manager, automate renewal with cert-manager, and keep a manual rotation runbook. Test it quarterly. Nobody wants to be figuring this out at 3am.
Scenario 2: Upstream Service Discovery Goes Stale - The gateway's route table stops getting updates from the service registry (say, a Consul agent crashes). Requests keep routing to decommissioned IPs, resulting in 502s on 30% of traffic. Detection: track upstream_connect_failures_total and service_registry_last_sync_epoch. Alert if sync age exceeds 2x the expected refresh interval (typically anything over 60s). Recovery: gateways should have a fallback static route table and aggressive health-check eviction. Mark backends unhealthy after 2 consecutive 5xx responses within 10s.
Scenario 3: Response Aggregation Cascade Timeout - A composite endpoint fans out to 4 services. One of them (let's say Recommendations) develops 8s P99 latency. Without per-upstream timeouts, the gateway holds connections open, exhausting the connection pool within minutes. Thread starvation causes all routes to degrade, not just the composite one. This is the kind of failure that ruins a weekend. Detection: per-route P99 latency dashboards, connection pool utilization gauge (alert at >75%). Recovery: set per-upstream timeouts (e.g., 2s hard cutoff), implement bulkhead isolation so one slow upstream can't starve shared resources, and configure circuit breakers that open after 50% error rate over a 30s window.
Capacity Planning
A single Kong or NGINX gateway instance typically handles 20,000-40,000 RPS with sub-5ms added latency on commodity hardware (4 vCPU, 8GB RAM). AWS API Gateway supports up to 10,000 RPS per region by default (soft limit, raisable to 100,000+).
| Metric | Threshold | Action |
|---|---|---|
| CPU utilization | > 60% sustained | Scale horizontally |
| P99 latency | > 50ms (excluding upstream) | Profile plugins, reduce chain |
| Connection pool usage | > 75% | Increase pool size or add instances |
| Error rate (5xx) | > 0.1% | Investigate upstream health |
| Memory | > 70% | Check for connection leaks, tune buffers |
Real-world numbers worth knowing: Netflix's Zuul 2 fleet handles ~1.5M RPS total, with each instance running about 83K RPS. Plan capacity at 3x the observed peak traffic to handle organic spikes and seasonal surges. The formula: required_instances = (peak_rps * 3) / per_instance_rps_at_60%_cpu. Always load-test the gateway with all active plugins enabled. Each plugin (auth, rate-limit, logging) adds 0.5-2ms of cumulative latency. It's surprising how fast that adds up.
Architecture Decision Record
ADR: Choosing an API Gateway Strategy
Context: The team needs to pick a gateway pattern that balances operational cost, latency, and flexibility.
| Criteria (Weight) | Managed (AWS APIGW) | Self-Hosted (Kong) | Mesh-Native (Envoy) |
|---|---|---|---|
| Ops overhead (25%) | Low, fully managed | Medium, requires tuning | High, needs deep Envoy expertise |
| Latency (25%) | 10-30ms added | 1-5ms added | 1-3ms added |
| Cost at 1M req/day (20%) | ~$105/mo | ~$50/mo (infra) | ~$50/mo (infra) |
| Cost at 1B req/day (20%) | ~$105K/mo | ~$2K/mo (infra) | ~$2K/mo (infra) |
| Plugin ecosystem (10%) | Limited | 100+ plugins | Filter-based, Lua/WASM |
Decision framework:
- Team < 20 engineers AND traffic < 50K RPM AND on AWS - Go with AWS API Gateway. The reduced ops burden is worth the per-request cost at this scale. Integration with Lambda, Cognito, and WAF saves weeks of glue code.
- Team 20-100 engineers AND traffic 50K-500K RPM - Deploy Kong or NGINX on Kubernetes with at least 3 replicas across 2 AZs. Plugin flexibility is needed for custom auth, transformation, and observability. This is the sweet spot for most mid-size teams.
- Team > 100 engineers OR traffic > 500K RPM OR multi-region - Build on Envoy with a control plane (Gloo Edge or custom xDS). At this scale, per-request managed pricing becomes absurd and sub-2ms gateway overhead is essential. Netflix, Uber, and Lyft all run Envoy-based custom gateways at this tier, and there's a reason for that.
- Hybrid pattern - Use a managed gateway for external/partner APIs (WAF and DDoS protection come with minimal effort) and a self-hosted gateway for internal east-west traffic (low latency, high throughput). This is honestly the most pragmatic approach for many organizations.
Key Points
- •Single entry point for all client requests, centralizing cross-cutting concerns like auth and rate limiting
- •Handles authentication, rate limiting, routing, and protocol translation in one place
- •Can aggregate multiple microservice calls into a single client response
- •Sits on the critical path. Must be highly available and low-latency or everything suffers
- •Decouples client interface from internal service topology
Tool Comparison
| Tool | Type | Best For | Scale |
|---|---|---|---|
| Kong | Open Source | Plugin ecosystem, Lua extensibility | Medium-Enterprise |
| AWS API Gateway | Managed | Serverless, Lambda integration | Small-Enterprise |
| Envoy | Open Source | Service mesh sidecar, gRPC-native | Large-Enterprise |
| NGINX | Open Source | High-performance reverse proxy | Small-Enterprise |
Common Mistakes
- Single point of failure without redundancy. Always deploy at least two instances behind a load balancer
- Putting business logic in the gateway layer. Keep it thin: route and validate, nothing else
- Not implementing circuit breakers for downstream service failures
- Ignoring tail latency. The gateway adds P99 overhead to every single request
- Skipping request/response transformation versioning, which breaks clients on deploy