Every public API and service eventually needs rate limiting. The naive version — “max N requests per minute” — is rarely what production needs. Choosing between token bucket, leaky bucket, fixed window, and sliding window algorithms determines how your service behaves under burst, sustained load, and edge cases.
This post compares the algorithms at a practical level: how each works, where each shines, what they look like in code, and how to choose for your use case.
Why Rate Limiting Exists
Rate limiting protects:
- Capacity — your service from being overwhelmed.
- Costs — your downstream services and infrastructure bills.
- Fairness — heavy users from starving others.
- Abuse — automated tools from hammering endpoints.
The goal is to allow legitimate use while throttling or rejecting excessive use. The algorithm choice determines how strict or lenient the boundary feels to the user.
Fixed Window
The simplest algorithm. Divide time into fixed windows (e.g., 1-minute buckets). Count requests in the current window. If count exceeds the limit, reject.
const counts: Record<string, { window: number, count: number }> = {}
function isAllowed(key: string, limit: number, windowMs: number): boolean {
const now = Date.now()
const currentWindow = Math.floor(now / windowMs)
const entry = counts[key]
if (!entry || entry.window !== currentWindow) {
counts[key] = { window: currentWindow, count: 1 }
return true
}
if (entry.count >= limit) return false
entry.count++
return true
}
Pros:
- Trivial to implement and reason about.
- O(1) memory per key.
- Fast.
Cons:
- Boundary problem. A user can make 100 requests in the last second of one window and 100 more in the first second of the next — 200 requests in 2 seconds while ostensibly limited to 100/minute.
- Bursty rejections at window boundaries.
For most “rough rate limit” use cases, fixed window is fine. For strict enforcement, look at sliding window.
Sliding Window Log
Track every request’s timestamp. To check if a new request is allowed, count requests in the last windowMs milliseconds.
const logs: Record<string, number[]> = {}
function isAllowed(key: string, limit: number, windowMs: number): boolean {
const now = Date.now()
const cutoff = now - windowMs
const log = logs[key] ?? []
const recent = log.filter(t => t > cutoff)
if (recent.length >= limit) {
logs[key] = recent
return false
}
recent.push(now)
logs[key] = recent
return true
}
Pros:
- Perfectly accurate. No boundary problems.
- Easy to reason about.
Cons:
- O(N) memory per key, where N = limit.
- For 10,000 keys with 1000 RPM limits, that’s 10 million timestamps in memory.
- Sorting / filtering on every check.
For low-volume or high-stakes limits, sliding window log works well. For massive scale, the memory blows up.
Sliding Window Counter
A compromise. Use two counters: one for the current window, one for the previous. Estimate the count “if the window were exactly N milliseconds back from now” by weighting the two.
function isAllowed(key: string, limit: number, windowMs: number): boolean {
const now = Date.now()
const currentWindow = Math.floor(now / windowMs)
const positionInWindow = (now % windowMs) / windowMs
const entry = state[key] ?? { current: 0, currentWindow, previous: 0 }
if (entry.currentWindow !== currentWindow) {
entry.previous = entry.currentWindow === currentWindow - 1 ? entry.current : 0
entry.current = 0
entry.currentWindow = currentWindow
}
const weightedCount = entry.previous * (1 - positionInWindow) + entry.current
if (weightedCount >= limit) {
state[key] = entry
return false
}
entry.current++
state[key] = entry
return true
}
Pros:
- O(1) memory per key.
- Much smoother than fixed window.
- Approximately accurate (~5% drift from true sliding).
Cons:
- Approximate, not exact.
- Slightly more complex than fixed window.
For most production use cases, sliding window counter is the sweet spot. Cloudflare and many large rate-limiting systems use variations of this.
Token Bucket
A different mental model: each key has a “bucket” that holds up to N tokens. Tokens are added at rate R per second. Each request consumes one token. If the bucket is empty, reject.
function isAllowed(key: string, capacity: number, refillPerMs: number): boolean {
const now = Date.now()
const bucket = buckets[key] ?? { tokens: capacity, lastRefill: now }
// Refill tokens based on elapsed time
const elapsed = now - bucket.lastRefill
bucket.tokens = Math.min(capacity, bucket.tokens + elapsed * refillPerMs)
bucket.lastRefill = now
if (bucket.tokens < 1) {
buckets[key] = bucket
return false
}
bucket.tokens--
buckets[key] = bucket
return true
}
Pros:
- Allows bursts up to the bucket capacity, then steady rate.
- O(1) memory.
- Maps cleanly to “sustained rate + burst allowance.”
Cons:
- Two parameters to tune (capacity and rate) instead of one.
- The “burst allowance” can mask sustained problems if capacity is too high.
For APIs where bursts are expected (e.g., a user logging in and immediately fetching 10 resources), token bucket feels natural. AWS API rate limiting uses token bucket extensively.
Leaky Bucket
Conceptually similar but inverts the model: requests enter a bucket; the bucket “leaks” at a constant rate. If the bucket overflows, reject.
function isAllowed(key: string, capacity: number, leakPerMs: number): boolean {
const now = Date.now()
const bucket = buckets[key] ?? { level: 0, lastLeak: now }
// Drain the bucket based on elapsed time
const elapsed = now - bucket.lastLeak
bucket.level = Math.max(0, bucket.level - elapsed * leakPerMs)
bucket.lastLeak = now
if (bucket.level >= capacity) {
buckets[key] = bucket
return false
}
bucket.level++
buckets[key] = bucket
return true
}
Pros:
- Smooths out bursts. The output rate is constant.
- Good for systems that need a predictable output rate.
Cons:
- Rejects more aggressively than token bucket (no burst allowance).
- Less intuitive for “user-facing” rate limits.
Leaky bucket is the classic choice for request queuing — you want a constant outflow rate to a downstream system regardless of inflow bursts. Token bucket is the classic choice for request limiting — allow some burst, throttle sustained excess.
Comparing in a Concrete Scenario
“Limit users to 60 requests per minute” with each algorithm:
Fixed window (60/min)
- 0:00:00 — user does 60 requests. OK.
- 0:00:01 — user does 1 more. Rejected.
- 0:01:00 — counter resets. Available again.
- Edge case: user does 60 at 0:00:59 and 60 at 0:01:00 — 120 requests in 2 seconds.
Sliding window log (60/60s window)
- 0:00:00 — user does 60 requests. OK.
- 0:00:01 — rejected (60 in last 60s).
- 0:00:59 — rejected.
- 0:01:01 — user can do 60 more (oldest fell out of window).
- No edge cases. Perfectly fair.
Token bucket (capacity 60, refill 1/sec)
- 0:00:00 — user does 60 requests (drains bucket). OK.
- 0:00:01 — 1 token regenerated. 1 more OK.
- 0:00:02 — 1 more.
- After 60 seconds, 60 tokens back.
- Allows initial burst of 60 then sustained 1/sec.
Leaky bucket (capacity 60, leak 1/sec)
- 0:00:00 — user does 60 requests (fills bucket). OK.
- 0:00:01 — bucket leaks 1, allowing 1 more. OK.
- Steady 1/sec from then.
- Smooths output to 1/sec.
Each algorithm handles the same target throughput differently. Pick based on what behavior you want.
Distributed Rate Limiting
Single-process rate limiting is easy. Multi-instance rate limiting is harder because each instance needs to coordinate.
Option A: Redis-backed
Centralize state in Redis. Each instance does atomic increment-and-check operations.
async function isAllowed(key: string, limit: number, windowMs: number): Promise<boolean> {
const count = await redis.incr(`rl:${key}`)
if (count === 1) await redis.pexpire(`rl:${key}`, windowMs)
return count <= limit
}
This is a fixed-window counter. Variations for sliding window or token bucket are straightforward but more complex.
Option B: Approximate distributed
Each instance keeps a local counter and periodically syncs to a shared source. Approximate; faster than centralized; can over- or under-shoot the limit slightly.
Option C: At the edge
Push rate limiting to your CDN (Cloudflare Rate Limiting, AWS WAF Rate Limiting). The CDN sees all traffic and enforces centrally before your origin sees the request.
For high-scale services in 2026, edge rate limiting is usually the right answer. Your application code becomes simpler; the CDN does the heavy lifting.
Identifying Users to Rate Limit
What you rate-limit by matters:
- Per IP — Simple. Vulnerable to CGNAT mixing legitimate users.
- Per API key — Best for authenticated APIs.
- Per account — Better than per-IP for user-facing apps.
- Per device fingerprint — Best for anti-abuse on unauthenticated endpoints.
- Per endpoint + IP — Tighter limits on expensive endpoints.
For rate-limiting an IP-lookup API specifically, see rate limiting IP lookup API.
Communicating Limits
Standard HTTP headers for telling clients about rate limits:
X-RateLimit-Limit: 60
X-RateLimit-Remaining: 23
X-RateLimit-Reset: 1700000000
Or the newer standardized headers (RFC 9421):
RateLimit: limit=60, remaining=23, reset=42
When rejecting:
HTTP/1.1 429 Too Many Requests
Retry-After: 60
The Retry-After should be honored by well-behaved clients (and SDKs that wrap your API).
Common Pitfalls
Per-instance limits in a multi-instance service
You set a 100 RPM limit. You have 10 instances. Effective limit is 1000 RPM. Centralize state or accept the over-shoot.
Counting too soon
If you count before the work succeeds, failed requests still consume budget. Depends on your goal — sometimes that’s right, sometimes not.
Forgetting cleanup
Old keys accumulate. Use Redis TTLs or periodic cleanup to prevent memory bloat.
Treating all endpoints the same
A login endpoint should be tighter than a static-content endpoint. Per-endpoint limits, with different thresholds, beat one global limit.
Hard rejections instead of degradation
Rejecting with 429 is harsh. For interactive UX, consider degrading (slower response, less data, queued processing) instead of rejecting outright.
TL;DR
- Fixed window — simplest; boundary problems at window edges.
- Sliding window log — exact but O(N) memory.
- Sliding window counter — approximate but O(1) memory; the production sweet spot.
- Token bucket — allows bursts then steady rate; good for user-facing APIs.
- Leaky bucket — smooths output to constant rate; good for queue-style systems.
- Distributed: Redis-backed or CDN-edge for scale.
- Communicate limits via standard headers.
- Per-endpoint tuning beats one global limit.
Rate limiting is one of those infrastructure pieces that’s invisible when it works and catastrophic when it doesn’t. Picking the right algorithm for your use case isn’t huge (most are interchangeable for typical APIs) but understanding the trade-offs helps you predict edge-case behavior. For application-specific patterns, see rate limiting IP lookup API; for the broader defensive picture, DDoS protection basics.