Why the API Gateway Became a Critical Control Plane
As systems decompose into microservices, the entry point where external traffic meets internal services becomes one of the highest-leverage locations in your architecture. A well-designed API gateway is not just a reverse proxy — it is the place where you enforce authentication, apply rate limits, shape traffic, observe every request, and route to the right service version without touching application code.
The patterns that make gateways production-grade are often underspecified. Most documentation covers basic routing. This article covers what actually matters in production: rate limiting algorithms, edge authentication with JWTs and API keys, circuit breaking, canary deployments, request/response transformation, and the operational model for tools like Kong Gateway, AWS API Gateway, and Envoy Proxy.
| Tool | Model | Config | Best for |
|---|---|---|---|
| Kong Gateway | OSS / Enterprise | Admin API / declarative YAML | Self-hosted, plugin ecosystem |
| AWS API Gateway | Managed SaaS | CloudFormation / CDK | Serverless Lambda backends |
| Envoy / Istio | OSS sidecar / mesh | xDS API / CRDs | Kubernetes, service mesh, mTLS |
| Traefik | OSS / Enterprise | Labels / CRDs / static config | Docker & Kubernetes auto-discovery |
Rate Limiting — Algorithms That Actually Matter in Production
Rate limiting at the gateway prevents upstream services from being overwhelmed, enforces per-customer quotas, and is the first line of defence against credential stuffing and scraping. The algorithm you choose has significant impact on how "fair" the limiting feels to legitimate clients and how bursty traffic is handled.
Token Bucket
The token bucket algorithm maintains a bucket of tokens that refills at a constant rate (e.g., 100 tokens/second). Each request consumes one token. If the bucket is empty, the request is rejected with HTTP 429. The bucket has a maximum capacity — burst headroom above the steady-state rate. This is the most widely implemented algorithm because it naturally allows short bursts while enforcing a long-term average.
# Kong Gateway — rate limiting plugin via declarative config (deck)
# File: kong.yaml
services:
- name: user-api
url: http://user-service:8080
routes:
- name: user-api-route
paths:
- /api/v1/users
plugins:
- name: rate-limiting
config:
minute: 60 # 60 requests per minute steady-state
hour: 1000 # 1000 requests per hour cap
policy: redis # use Redis for distributed rate limiting
redis_host: redis
redis_port: 6379
redis_password: ${REDIS_PASSWORD}
limit_by: consumer # per authenticated consumer, not per IP
hide_client_headers: false
# Response headers added:
# X-RateLimit-Limit-Minute: 60
# X-RateLimit-Remaining-Minute: 45
# X-RateLimit-Reset-Minute: 1714089600Sliding Window Counter
A fixed window counter (e.g., "100 requests per minute, resetting at :00") has a well-known edge case: a client can send 100 requests at :59 and 100 more at :01, effectively 200 requests in two seconds without triggering the limit. Sliding window counters fix this by computing the rate over a rolling window. The implementation uses two adjacent window counts weighted by the fraction of the current window that has elapsed.
import time
import redis
class SlidingWindowRateLimiter:
"""
Sliding window rate limiter backed by Redis.
Uses two fixed-window counters and weights them by elapsed fraction.
"""
def __init__(self, redis_client: redis.Redis, limit: int, window_seconds: int):
self.redis = redis_client
self.limit = limit
self.window = window_seconds
def is_allowed(self, key: str) -> tuple[bool, dict]:
now = time.time()
current_window = int(now // self.window)
previous_window = current_window - 1
elapsed_fraction = (now % self.window) / self.window
current_key = f"rl:{key}:{current_window}"
previous_key = f"rl:{key}:{previous_window}"
pipe = self.redis.pipeline()
pipe.get(current_key)
pipe.get(previous_key)
current_count_raw, previous_count_raw = pipe.execute()
current_count = int(current_count_raw or 0)
previous_count = int(previous_count_raw or 0)
# Weighted estimate: previous window contributes (1 - elapsed) fraction
estimated_count = previous_count * (1 - elapsed_fraction) + current_count
if estimated_count >= self.limit:
reset_at = (current_window + 1) * self.window
return False, {
"limit": self.limit,
"remaining": 0,
"reset": int(reset_at),
}
# Increment current window counter
pipe = self.redis.pipeline()
pipe.incr(current_key)
pipe.expire(current_key, self.window * 2) # keep for 2 windows
pipe.execute()
remaining = max(0, self.limit - int(estimated_count) - 1)
return True, {
"limit": self.limit,
"remaining": remaining,
"reset": int((current_window + 1) * self.window),
}
# FastAPI middleware usage
from fastapi import FastAPI, Request, HTTPException
from fastapi.responses import JSONResponse
app = FastAPI()
limiter = SlidingWindowRateLimiter(
redis_client=redis.Redis(host="redis", port=6379),
limit=100,
window_seconds=60,
)
@app.middleware("http")
async def rate_limit_middleware(request: Request, call_next):
# Use authenticated user ID if available, fall back to IP
client_key = request.headers.get("X-Consumer-ID") or request.client.host
allowed, headers = limiter.is_allowed(client_key)
if not allowed:
return JSONResponse(
status_code=429,
content={"error": "rate_limit_exceeded", "retry_after": headers["reset"]},
headers={
"X-RateLimit-Limit": str(headers["limit"]),
"X-RateLimit-Remaining": "0",
"X-RateLimit-Reset": str(headers["reset"]),
"Retry-After": str(headers["reset"] - int(time.time())),
},
)
response = await call_next(request)
response.headers["X-RateLimit-Limit"] = str(headers["limit"])
response.headers["X-RateLimit-Remaining"] = str(headers["remaining"])
response.headers["X-RateLimit-Reset"] = str(headers["reset"])
return responseNote
Authentication at the Edge — JWT Validation and API Keys
Authenticating at the gateway — before requests ever reach upstream services — eliminates auth duplication across every microservice and creates a single enforcement point. Two patterns dominate: JWT validation for user-facing APIs with short-lived tokens, and API key authentication for machine-to-machine integrations.
JWT Validation at the Gateway
The gateway validates the JWT signature against a public key (fetched from a JWKS endpoint), checks expiry and required claims, and forwards the decoded claims as request headers to upstream services. Upstream services trust the gateway — they do not re-validate the token. This pattern keeps cryptographic operations centralized and lets services focus on business logic.
# Kong JWT plugin — declarative configuration
# Validates RS256 tokens issued by your identity provider
plugins:
- name: jwt
config:
uri_param_names: [] # do not accept JWT in query params
cookie_names: [] # do not accept JWT in cookies
header_names:
- Authorization # Bearer token in Authorization header
claims_to_verify:
- exp # verify expiry
- nbf # verify not-before (if present)
key_claim_name: iss # use iss claim to look up the public key
secret_is_base64: false
anonymous: null # null = reject unauthenticated requests
run_on_preflight: true
---
# Consumer with associated RSA public key (JWKS-based setup)
# In production, use the OIDC plugin or an external JWKS URL instead
consumers:
- username: my-service
jwt_secrets:
- key: "https://auth.example.com" # matches iss claim in JWT
algorithm: RS256
rsa_public_key: |
-----BEGIN PUBLIC KEY-----
MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEA...
-----END PUBLIC KEY-----# Nginx / OpenResty — inline JWT validation with lua-resty-jwt
# Alternative to a full gateway when you need custom claim routing
location /api/ {
access_by_lua_block {
local jwt = require "resty.jwt"
local cjson = require "cjson"
local auth_header = ngx.req.get_headers()["Authorization"]
if not auth_header or not auth_header:match("^Bearer ") then
ngx.status = 401
ngx.header["WWW-Authenticate"] = 'Bearer realm="api"'
ngx.say(cjson.encode({error = "missing_token"}))
return ngx.exit(401)
end
local token = auth_header:sub(8) -- strip "Bearer "
local verified = jwt:verify(
os.getenv("JWT_PUBLIC_KEY"),
token,
{
lifetime_grace_period = 30, -- 30s clock skew tolerance
valid_issuers = {"https://auth.example.com"},
valid_audiences = {"api.example.com"},
}
)
if not verified.verified then
ngx.status = 401
ngx.say(cjson.encode({error = "invalid_token", detail = verified.reason}))
return ngx.exit(401)
end
-- Forward decoded claims to upstream
ngx.req.set_header("X-Consumer-ID", verified.payload.sub)
ngx.req.set_header("X-Consumer-Roles", table.concat(verified.payload.roles or {}, ","))
ngx.req.set_header("X-Tenant-ID", verified.payload.tenant_id)
-- Remove raw Authorization header before forwarding (optional)
-- ngx.req.clear_header("Authorization")
}
proxy_pass http://upstream_service;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}API Key Authentication
API keys are long-lived opaque tokens suitable for server-to-server integrations where short-lived JWTs would add unnecessary complexity. The key is stored hashed in the gateway's database (never in plaintext), and the gateway performs a lookup on every request. For high-throughput APIs, cache the lookup result in Redis with a short TTL (30–60 seconds) to avoid database hotspots.
# Generating and storing API keys securely — Python example
import secrets
import hashlib
from datetime import datetime, UTC
def generate_api_key() -> tuple[str, str]:
"""
Returns (raw_key, hashed_key).
raw_key is shown to the user once, hashed_key is stored.
"""
raw = secrets.token_urlsafe(32) # 256 bits of entropy
hashed = hashlib.sha256(raw.encode()).hexdigest()
return raw, hashed
def create_api_key_for_service(service_name: str, scopes: list[str], db) -> str:
raw_key, hashed_key = generate_api_key()
db.execute(
"""
INSERT INTO api_keys (key_hash, service_name, scopes, created_at, last_used_at)
VALUES ($1, $2, $3, $4, NULL)
""",
hashed_key, service_name, scopes, datetime.now(UTC),
)
# Return raw key — this is the ONLY time the client sees it
return raw_key
# Gateway middleware — validate incoming API key
async def validate_api_key(request: Request, db, cache: redis.Redis) -> dict:
api_key = (
request.headers.get("X-API-Key")
or request.query_params.get("api_key")
)
if not api_key:
raise HTTPException(status_code=401, detail="missing_api_key")
key_hash = hashlib.sha256(api_key.encode()).hexdigest()
# Check Redis cache first
cached = await cache.get(f"apikey:{key_hash}")
if cached:
return json.loads(cached)
# Fall back to database
row = await db.fetchrow(
"SELECT service_name, scopes, revoked_at FROM api_keys WHERE key_hash = $1",
key_hash,
)
if not row or row["revoked_at"] is not None:
raise HTTPException(status_code=401, detail="invalid_api_key")
service_data = {"service": row["service_name"], "scopes": row["scopes"]}
# Cache for 60 seconds — short enough to respect revocations quickly
await cache.setex(f"apikey:{key_hash}", 60, json.dumps(service_data))
await db.execute(
"UPDATE api_keys SET last_used_at = NOW() WHERE key_hash = $1", key_hash
)
return service_dataNote
Circuit Breaking — Failing Fast Before Cascading Failures
When an upstream service becomes slow or unhealthy, without circuit breaking the gateway continues forwarding requests, threads accumulate waiting for responses, connection pools exhaust, and the failure cascades upstream. Circuit breaking interrupts this by tracking failure rates and temporarily rejecting requests to unhealthy upstreams — allowing them time to recover.
The classic circuit breaker state machine has three states: Closed (normal operation, failures tracked), Open (requests rejected immediately, upstream gets recovery time), and Half-Open (a probe request is allowed through — if it succeeds, the circuit closes; if it fails, it re-opens).
# Envoy circuit breaker configuration — cluster-level
# Applied to all requests routed to the upstream cluster
clusters:
- name: user_service
type: STRICT_DNS
load_assignment:
cluster_name: user_service
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: user-service
port_value: 8080
circuit_breakers:
thresholds:
- priority: DEFAULT
max_connections: 100 # max concurrent TCP connections
max_pending_requests: 50 # max requests queued while waiting for connection
max_requests: 200 # max concurrent active requests
max_retries: 3 # max concurrent retries in flight
track_remaining: true # expose stats on remaining capacity
# Outlier detection — ejects unhealthy hosts from the load balancer
outlier_detection:
consecutive_5xx: 5 # eject after 5 consecutive 5xx responses
interval: 10s # evaluation interval
base_ejection_time: 30s # minimum ejection duration
max_ejection_percent: 50 # never eject more than 50% of hosts
success_rate_minimum_hosts: 3 # need at least 3 hosts to compute success rate
success_rate_request_volume: 100
success_rate_stdev_factor: 1900
# Upstream connection timeout settings
connect_timeout: 2s
upstream_connection_options:
tcp_keepalive:
keepalive_probes: 3
keepalive_time: 30
keepalive_interval: 5# Kong circuit breaker via the upstream health checks config
# Passive health checks detect failures from actual traffic
upstreams:
- name: user-service-upstream
algorithm: round-robin
healthchecks:
passive:
healthy:
http_statuses: [200, 201, 202, 204]
successes: 3 # 3 consecutive successes re-marks target healthy
unhealthy:
http_statuses: [429, 500, 502, 503, 504]
http_failures: 5 # 5 failures marks target unhealthy
tcp_failures: 2
timeouts: 3
active:
healthy:
interval: 10 # probe every 10s when healthy
http_statuses: [200, 204]
successes: 2
unhealthy:
interval: 5 # probe every 5s when unhealthy
http_statuses: [429, 500, 503]
http_failures: 3
http_path: /health
timeout: 2
concurrency: 5
targets:
- target: user-service-1:8080
weight: 100
- target: user-service-2:8080
weight: 100Canary Deployments and Traffic Shaping at the Gateway
The gateway is the ideal place to implement progressive delivery — routing a small percentage of traffic to a new service version before a full rollout. This decouples deployment (shipping the new binary) from release (sending traffic to it), and lets you validate the new version with real traffic while limiting blast radius.
There are two approaches: weight-based routing (route X% of all requests to v2 regardless of requester) and header-based routing (route requests with a specific header to v2, used for internal testing or beta user groups). Both can be combined.
# Kubernetes Gateway API — HTTPRoute for canary deployment
# Routes 10% of traffic to the new service version
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: user-api-canary
namespace: production
spec:
parentRefs:
- name: main-gateway
sectionName: https
hostnames:
- api.example.com
rules:
# Header-based routing — internal/beta users always hit v2
- matches:
- headers:
- name: X-Feature-Flag
value: canary-v2
backendRefs:
- name: user-service-v2
port: 8080
weight: 100
# Weight-based split — 10% canary, 90% stable
- matches:
- path:
type: PathPrefix
value: /api/v1/users
backendRefs:
- name: user-service-v1
port: 8080
weight: 90
- name: user-service-v2
port: 8080
weight: 10# Istio VirtualService — fine-grained traffic shaping
# Combines header matching, weight-based splitting, and fault injection for testing
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: user-service
namespace: production
spec:
hosts:
- user-service
http:
# Internal testers: 100% to v2
- match:
- headers:
x-canary-user:
exact: "true"
route:
- destination:
host: user-service
subset: v2
weight: 100
# Fault injection for chaos testing — 5% of requests get 500ms delay
- match:
- headers:
x-chaos-test:
exact: "true"
fault:
delay:
percentage:
value: 5.0
fixedDelay: 500ms
route:
- destination:
host: user-service
subset: v1
# Default: 95% stable, 5% canary
- route:
- destination:
host: user-service
subset: v1
weight: 95
- destination:
host: user-service
subset: v2
weight: 5
timeout: 5s
retries:
attempts: 3
perTryTimeout: 2s
retryOn: gateway-error,connect-failure,retriable-4xx
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: user-service
namespace: production
spec:
host: user-service
subsets:
- name: v1
labels:
version: v1
- name: v2
labels:
version: v2Note
Request and Response Transformation
Gateways can modify requests before they reach upstream services and responses before they reach clients. Common use cases: adding correlation IDs, stripping internal headers, normalizing paths across API versions, injecting upstream auth headers, and rewriting response bodies to mask internal service details.
# Kong request-transformer plugin — add/remove/replace headers and body fields
plugins:
- name: request-transformer
config:
add:
headers:
- "X-Request-ID:${uuid()}" # inject correlation ID
- "X-Forwarded-Tenant:${consumer.custom_id}"
- "X-Internal-Auth:Bearer ${env.INTERNAL_SERVICE_TOKEN}"
querystring: []
remove:
headers:
- Authorization # strip external auth before forwarding
- X-Forwarded-For # handled by gateway
replace:
headers:
- "Host:internal-user-service.svc.cluster.local"
- name: response-transformer
config:
remove:
headers:
- X-Powered-By # don't expose internal stack
- Server # don't expose server version
- X-Internal-Request-ID # strip internal headers from response
add:
headers:
- "X-Request-ID:${request_id}" # echo back for client correlation
- "Cache-Control:no-store" # force no-cache on authenticated responsesObservability — What to Instrument at the Gateway Layer
The gateway sees every request, making it the ideal place to emit traces, metrics, and structured logs. The key metrics to track at the gateway level, per route and per consumer:
Request rate and error rate
Track requests_total labeled by route, method, and status code family (2xx, 4xx, 5xx). Error rate = 5xx / total. Alert when error rate exceeds 1% for 5 minutes. Track 429 rate separately — a spike indicates either a legitimate traffic surge or a client misbehaving.
Latency percentiles
Track p50, p95, p99 latency per route. The gap between p95 and p99 reveals tail latency. Alert on p99 latency exceeding your SLO. Track upstream latency separately from gateway overhead — if gateway overhead exceeds 5ms p99, investigate middleware or plugin ordering.
Rate limit hit rate
Track the ratio of rate-limited (429) to total requests per consumer. A consumer consistently hitting rate limits may need quota increases or may indicate a client bug (e.g., retry storms). Track this metric to proactively reach out before it becomes a support incident.
Upstream connection pool saturation
Track the number of pending requests waiting for a connection to the upstream pool. If this metric is consistently above zero, the connection pool is undersized for your traffic or the upstream is too slow. This is the leading indicator before circuit breakers trigger.
# Kong Prometheus plugin — export gateway metrics to Prometheus
plugins:
- name: prometheus
config:
status_code_metrics: true # per-route status code counters
latency_metrics: true # gateway + upstream latency histograms
bandwidth_metrics: true # bytes in/out per route
upstream_health_metrics: true
# Resulting metric examples (scrape at /metrics on Kong's metrics port 8001):
#
# kong_http_requests_total{service="user-api",route="user-api-route",
# method="GET",code="200"} 15432
#
# kong_latency_bucket{service="user-api",type="kong",le="5"} 14000
# kong_latency_bucket{service="user-api",type="upstream",le="50"} 13800
#
# kong_upstream_target_health{upstream="user-service-upstream",
# target="user-service-1:8080",address="10.0.1.1:8080",
# state="healthchecks_off|healthy|unhealthy|dns_error"} 1Production Hardening Checklist
Run the gateway in HA mode with distributed state
A single gateway instance is a single point of failure. Run at least two replicas behind a load balancer. For rate limiting and session state, use Redis (not in-process memory) so all replicas share the same counters. Test failover by killing a gateway pod and verifying traffic continues without disruption.
Set explicit timeouts on every upstream
Without explicit timeouts, a slow upstream holds connections indefinitely, exhausting the gateway's connection pool. Set three timeout values per upstream: connection timeout (how long to wait for a TCP connection — typically 1–2s), read timeout (how long to wait for the first byte of the response — typically 5–30s depending on the endpoint), and write timeout (how long to wait for the client to send the request body).
Implement request size limits
Without body size limits, a malicious client can send a multi-gigabyte payload that buffers in gateway memory. Set a global maximum request body size (e.g., 10MB) with a lower limit on specific endpoints that handle only small JSON payloads. Kong's request-size-limiting plugin handles this; Nginx's client_max_body_size directive works similarly.
Validate and version your gateway configuration
Gateway configuration should live in version control and go through CI. For Kong, use deck validate to lint declarative configs before applying. For AWS API Gateway, use CDK or CloudFormation with staged deployment and rollback. Never apply gateway config changes directly in production consoles — config drift is a production reliability risk.
Separate admin and data plane traffic
Kong's admin API (port 8001 by default) must never be exposed to the public internet — it allows full configuration changes without authentication unless explicitly secured. Bind admin API listeners to a private network interface or loop it through an internal-only load balancer. Apply mutual TLS and RBAC if multiple teams need admin API access.
Plan your plugin execution order carefully
Plugins execute in a defined order. In Kong, authentication plugins run before rate limiting — which means unauthenticated requests are rejected before consuming a rate limit slot (correct behavior). Verify that your plugin ordering matches your intended policy: auth first, then rate limiting by consumer, then request transformation, then routing. Incorrect ordering can create security gaps (rate limiting before auth allows unauthenticated requests to exhaust quotas).
Designing or hardening an API gateway for your microservices platform?
We design and implement production-grade API gateway architectures — from rate limiting strategies and edge authentication to circuit breaking, canary routing, and full observability. Let’s talk.
Get in Touch