API Gateway

Why It Exists

With a monolith, every client talks to one server. Simple. The moment the architecture moves to microservices, clients suddenly need to know the address of every service, manage auth tokens per service, and handle retries individually. That's a mess.

The API Gateway fixes this. One URL, one TLS termination point, one place to enforce policies. Clients don't need to know (or care) how many services sit behind it.

How It Works

Request Ingress - Client sends an HTTPS request to the gateway's public endpoint.
Authentication - The gateway validates JWT/OAuth tokens before routing anything. Bad requests get killed at the edge, which saves backends from doing pointless work.
Rate Limiting - A token bucket or sliding window algorithm checks per-client quotas. Exceed the limit and back comes a 429 Too Many Requests. No negotiation.
Routing - Path-based or header-based rules map the request to an upstream service. Most modern gateways also support weighted routing, which is how canary deployments work.
Request Transformation - Headers get added (trace IDs, tenant context), and bodies may be transformed (REST to gRPC, protocol buffers to JSON).
Response Aggregation - For composite endpoints, the gateway fans out to multiple services, merges the responses, and returns a single payload to the client.
Caching - Read-heavy endpoints that get hammered can be cached at the gateway layer with configurable TTLs. This is cheap and surprisingly effective.

Decision Criteria

Factor	Managed (AWS APIGW)	Self-Hosted (Kong/Envoy)
Ops burden	Low	High
Customization	Limited	Full control
Latency	~10-30ms overhead	~1-5ms overhead
Cost at scale	Expensive (per-request)	Infrastructure cost only
Vendor lock-in	High	None

Production Considerations

Horizontal scaling - Gateways must be stateless. Store rate limit counters in Redis, not in-process memory. The moment state goes into the gateway, it becomes a scaling headache.
Health checks - Run active health checks against upstream services. Passive checks catch failures after they happen. Active checks prevent routing to dead backends in the first place.
Graceful degradation - When a downstream service gets slow, the gateway should time out and return a degraded response. Don't hold connections open waiting for a miracle.
Observability - Every request should emit structured logs with correlation IDs, latency histograms, and error rates per upstream. If it's not possible to tell which upstream is misbehaving within 30 seconds, the observability is broken.
Blue-green deployments - Gateway routing rules shift traffic between service versions for zero-downtime deploys. This is one of the most practical benefits that comes for free.

Failure Scenarios

Scenario 1: Gateway-Wide TLS Certificate Expiry - A Let's Encrypt auto-renewal job silently fails. When the cert expires, every client gets TLS handshake errors. All external traffic drops to zero within seconds. Mobile apps with certificate pinning fail even after renewal until users update the app. Detection: monitor ssl_certificate_expiry_seconds and alert when it drops below 14 days. Run synthetic probes from external monitors (Datadog Synthetics, Pingdom). Recovery: pre-stage backup certificates in a secrets manager, automate renewal with cert-manager, and keep a manual rotation runbook. Test it quarterly. Nobody wants to be figuring this out at 3am.

Scenario 2: Upstream Service Discovery Goes Stale - The gateway's route table stops getting updates from the service registry (say, a Consul agent crashes). Requests keep routing to decommissioned IPs, resulting in 502s on 30% of traffic. Detection: track upstream_connect_failures_total and service_registry_last_sync_epoch. Alert if sync age exceeds 2x the expected refresh interval (typically anything over 60s). Recovery: gateways should have a fallback static route table and aggressive health-check eviction. Mark backends unhealthy after 2 consecutive 5xx responses within 10s.

Scenario 3: Response Aggregation Cascade Timeout - A composite endpoint fans out to 4 services. One of them (let's say Recommendations) develops 8s P99 latency. Without per-upstream timeouts, the gateway holds connections open, exhausting the connection pool within minutes. Thread starvation causes all routes to degrade, not just the composite one. This is the kind of failure that ruins a weekend. Detection: per-route P99 latency dashboards, connection pool utilization gauge (alert at >75%). Recovery: set per-upstream timeouts (e.g., 2s hard cutoff), implement bulkhead isolation so one slow upstream can't starve shared resources, and configure circuit breakers that open after 50% error rate over a 30s window.

Capacity Planning

A single Kong or NGINX gateway instance typically handles 20,000-40,000 RPS with sub-5ms added latency on commodity hardware (4 vCPU, 8GB RAM). AWS API Gateway supports up to 10,000 RPS per region by default (soft limit, raisable to 100,000+).

Metric	Threshold	Action
CPU utilization	> 60% sustained	Scale horizontally
P99 latency	> 50ms (excluding upstream)	Profile plugins, reduce chain
Connection pool usage	> 75%	Increase pool size or add instances
Error rate (5xx)	> 0.1%	Investigate upstream health
Memory	> 70%	Check for connection leaks, tune buffers

Real-world numbers worth knowing: Netflix's Zuul 2 fleet handles ~1.5M RPS total, with each instance running about 83K RPS. Plan capacity at 3x the observed peak traffic to handle organic spikes and seasonal surges. The formula: required_instances = (peak_rps * 3) / per_instance_rps_at_60%_cpu. Always load-test the gateway with all active plugins enabled. Each plugin (auth, rate-limit, logging) adds 0.5-2ms of cumulative latency. It's surprising how fast that adds up.

Architecture Decision Record

ADR: Choosing an API Gateway Strategy

Context: The team needs to pick a gateway pattern that balances operational cost, latency, and flexibility.

Criteria (Weight)	Managed (AWS APIGW)	Self-Hosted (Kong)	Mesh-Native (Envoy)
Ops overhead (25%)	Low, fully managed	Medium, requires tuning	High, needs deep Envoy expertise
Latency (25%)	10-30ms added	1-5ms added	1-3ms added
Cost at 1M req/day (20%)	~$105/mo	~$50/mo (infra)	~$50/mo (infra)
Cost at 1B req/day (20%)	~$105K/mo	~$2K/mo (infra)	~$2K/mo (infra)
Plugin ecosystem (10%)	Limited	100+ plugins	Filter-based, Lua/WASM

Decision framework:

Team < 20 engineers AND traffic < 50K RPM AND on AWS - Go with AWS API Gateway. The reduced ops burden is worth the per-request cost at this scale. Integration with Lambda, Cognito, and WAF saves weeks of glue code.
Team 20-100 engineers AND traffic 50K-500K RPM - Deploy Kong or NGINX on Kubernetes with at least 3 replicas across 2 AZs. Plugin flexibility is needed for custom auth, transformation, and observability. This is the sweet spot for most mid-size teams.
Team > 100 engineers OR traffic > 500K RPM OR multi-region - Build on Envoy with a control plane (Gloo Edge or custom xDS). At this scale, per-request managed pricing becomes absurd and sub-2ms gateway overhead is essential. Netflix, Uber, and Lyft all run Envoy-based custom gateways at this tier, and there's a reason for that.
Hybrid pattern - Use a managed gateway for external/partner APIs (WAF and DDoS protection come with minimal effort) and a self-hosted gateway for internal east-west traffic (low latency, high throughput). This is honestly the most pragmatic approach for many organizations.

Tool	Type	Best For	Scale
Kong	Open Source	Plugin ecosystem, Lua extensibility	Medium-Enterprise
AWS API Gateway	Managed	Serverless, Lambda integration	Small-Enterprise
Envoy	Open Source	Service mesh sidecar, gRPC-native	Large-Enterprise
NGINX	Open Source	High-performance reverse proxy	Small-Enterprise

Why It Exists

The API Gateway fixes this. One URL, one TLS termination point, one place to enforce policies. Clients don't need to know (or care) how many services sit behind it.

How It Works

Request Ingress - Client sends an HTTPS request to the gateway's public endpoint.
Authentication - The gateway validates JWT/OAuth tokens before routing anything. Bad requests get killed at the edge, which saves backends from doing pointless work.
Rate Limiting - A token bucket or sliding window algorithm checks per-client quotas. Exceed the limit and back comes a 429 Too Many Requests. No negotiation.
Routing - Path-based or header-based rules map the request to an upstream service. Most modern gateways also support weighted routing, which is how canary deployments work.
Request Transformation - Headers get added (trace IDs, tenant context), and bodies may be transformed (REST to gRPC, protocol buffers to JSON).
Response Aggregation - For composite endpoints, the gateway fans out to multiple services, merges the responses, and returns a single payload to the client.
Caching - Read-heavy endpoints that get hammered can be cached at the gateway layer with configurable TTLs. This is cheap and surprisingly effective.

Decision Criteria

Factor	Managed (AWS APIGW)	Self-Hosted (Kong/Envoy)
Ops burden	Low	High
Customization	Limited	Full control
Latency	~10-30ms overhead	~1-5ms overhead
Cost at scale	Expensive (per-request)	Infrastructure cost only
Vendor lock-in	High	None

Production Considerations

Horizontal scaling - Gateways must be stateless. Store rate limit counters in Redis, not in-process memory. The moment state goes into the gateway, it becomes a scaling headache.
Health checks - Run active health checks against upstream services. Passive checks catch failures after they happen. Active checks prevent routing to dead backends in the first place.
Graceful degradation - When a downstream service gets slow, the gateway should time out and return a degraded response. Don't hold connections open waiting for a miracle.
Observability - Every request should emit structured logs with correlation IDs, latency histograms, and error rates per upstream. If it's not possible to tell which upstream is misbehaving within 30 seconds, the observability is broken.
Blue-green deployments - Gateway routing rules shift traffic between service versions for zero-downtime deploys. This is one of the most practical benefits that comes for free.

Failure Scenarios

Capacity Planning

Metric	Threshold	Action
CPU utilization	> 60% sustained	Scale horizontally
P99 latency	> 50ms (excluding upstream)	Profile plugins, reduce chain
Connection pool usage	> 75%	Increase pool size or add instances
Error rate (5xx)	> 0.1%	Investigate upstream health
Memory	> 70%	Check for connection leaks, tune buffers

Architecture Decision Record

ADR: Choosing an API Gateway Strategy

Context: The team needs to pick a gateway pattern that balances operational cost, latency, and flexibility.

Criteria (Weight)	Managed (AWS APIGW)	Self-Hosted (Kong)	Mesh-Native (Envoy)
Ops overhead (25%)	Low, fully managed	Medium, requires tuning	High, needs deep Envoy expertise
Latency (25%)	10-30ms added	1-5ms added	1-3ms added
Cost at 1M req/day (20%)	~$105/mo	~$50/mo (infra)	~$50/mo (infra)
Cost at 1B req/day (20%)	~$105K/mo	~$2K/mo (infra)	~$2K/mo (infra)
Plugin ecosystem (10%)	Limited	100+ plugins	Filter-based, Lua/WASM

Decision framework:

Team < 20 engineers AND traffic < 50K RPM AND on AWS - Go with AWS API Gateway. The reduced ops burden is worth the per-request cost at this scale. Integration with Lambda, Cognito, and WAF saves weeks of glue code.
Team 20-100 engineers AND traffic 50K-500K RPM - Deploy Kong or NGINX on Kubernetes with at least 3 replicas across 2 AZs. Plugin flexibility is needed for custom auth, transformation, and observability. This is the sweet spot for most mid-size teams.
Team > 100 engineers OR traffic > 500K RPM OR multi-region - Build on Envoy with a control plane (Gloo Edge or custom xDS). At this scale, per-request managed pricing becomes absurd and sub-2ms gateway overhead is essential. Netflix, Uber, and Lyft all run Envoy-based custom gateways at this tier, and there's a reason for that.
Hybrid pattern - Use a managed gateway for external/partner APIs (WAF and DDoS protection come with minimal effort) and a self-hosted gateway for internal east-west traffic (low latency, high throughput). This is honestly the most pragmatic approach for many organizations.

Architecture Diagram

Why It Exists

How It Works

Decision Criteria

Production Considerations

Failure Scenarios

Capacity Planning

Architecture Decision Record

ADR: Choosing an API Gateway Strategy

Key Points

Tool Comparison

Common Mistakes

Related Topics

API Gateway

Architecture Diagram

Why It Exists

How It Works

Decision Criteria

Production Considerations

Failure Scenarios

Capacity Planning

Architecture Decision Record

ADR: Choosing an API Gateway Strategy

Key Points

Tool Comparison

Common Mistakes

Related Topics