On this page
What a rate limiter actually does
Why rate limiting matters (the business case)
Requirements gathering
Capacity estimation
The four rate limiting algorithms
- Token bucket
- Leaky bucket
- Fixed window counter
- Sliding window
Picking among the four
API design
Data model and storage
Architecture: where does the rate limiter live?
Pattern 1: API gateway rate limiting
Pattern 2: Middleware rate limiting in the service
Pattern 3: Sidecar rate limiter
The distributed rate limiting problem
Problem: Redis becomes a bottleneck
Problem: Redis latency adds to every request
Problem: Redis fails
Problem: consistency across regions
How real companies do it
Common interview follow-up questions
Putting it all together
Keep learning
How to Design a Rate Limiter: Algorithms, Architecture, and Trade-offs


On This Page
What a rate limiter actually does
Why rate limiting matters (the business case)
Requirements gathering
Capacity estimation
The four rate limiting algorithms
- Token bucket
- Leaky bucket
- Fixed window counter
- Sliding window
Picking among the four
API design
Data model and storage
Architecture: where does the rate limiter live?
Pattern 1: API gateway rate limiting
Pattern 2: Middleware rate limiting in the service
Pattern 3: Sidecar rate limiter
The distributed rate limiting problem
Problem: Redis becomes a bottleneck
Problem: Redis latency adds to every request
Problem: Redis fails
Problem: consistency across regions
How real companies do it
Common interview follow-up questions
Putting it all together
Keep learning
"How would you design a rate limiter?"
That question has shown up in more FAANG system design interviews than almost any other component-level question. It's asked because rate limiting sits at the intersection of almost every system design concept you care about: distributed counters, consistency trade-offs, Redis, algorithmic choice, fault tolerance, graceful degradation. If you can design a rate limiter cleanly, you've demonstrated competence across half the system design curriculum in one 45-minute conversation.
This guide is the full answer. By the time you're done, you'll know:
- What a rate limiter does and why every non-trivial system has one
- The four main rate limiting algorithms — how each works, with pseudocode, and when to pick which
- How real companies (Stripe, GitHub, Twitter) build rate limiters at scale
- The hardest part: distributed rate limiting across many servers
- How to answer the interview question step by step, following a senior-engineer structure
Let's dig in.
What a rate limiter actually does
A rate limiter is a mechanism that caps the number of requests or actions a client can perform within a time window. When a client exceeds the limit, the rate limiter rejects the excess (usually with an HTTP 429 status code — "Too Many Requests").
Think of it as a bouncer at a club. The club has capacity for 500 people. The bouncer counts who's inside, and once the count hits 500, new arrivals wait outside. When someone leaves, the bouncer lets one more in.
In production systems, rate limiting shows up everywhere:
- API gateways throttling requests per client API key (Stripe: 100 requests/sec per account)
- Login endpoints preventing brute-force attacks (5 attempts per username per minute)
- Write-heavy endpoints protecting expensive operations (post creation, file uploads)
- Expensive queries limiting resource-intensive calls (search, aggregation, exports)
- External service calls staying under third-party API quotas (Twilio, SendGrid, payment providers)
- CDN and DDoS protection absorbing traffic spikes at the edge (Cloudflare, AWS Shield)
Without rate limiting, one misbehaving client — intentional or accidental — can take down your whole service. The classic example: a customer's script gets stuck in a loop and hammers your API with 10,000 requests per second. Without a rate limiter, this brings down every other customer too. With a rate limiter, only the misbehaving customer sees errors.
Why rate limiting matters (the business case)
Before algorithms, the "why" matters — interviewers listen for whether you understand the business reasons, not just the mechanics.
Prevent overload and downtime. A traffic spike (intentional or a Denial-of-Service attack) can cascade through your system in seconds. Rate limiting is the first line of defense — it caps traffic before it reaches the vulnerable parts of your stack.
Fair resource allocation. On a shared service, one noisy client can monopolize resources and degrade everyone else's experience. Rate limiting ensures each client gets a fair share.
Security against abuse. Credential stuffing, web scraping, enumeration attacks — they all rely on sending lots of requests fast. Rate limiting at the right layer kills these attacks structurally, not just via detection.
Cost control. Every request costs you — CPU, bandwidth, database load, downstream API calls. If you call a paid API (Twilio, OpenAI, Stripe) on every user request and one user goes into a loop, the unbounded cost is yours. Rate limiting caps that.
Contractual enforcement. If you sell tiered API access (free = 100 req/day, pro = 10,000 req/day, enterprise = unlimited), rate limiting is literally how you enforce the tier.
In an interview, lead with these reasons briefly, then get to the design. Candidates who jump straight to algorithms without explaining the "why" signal that they've memorized material rather than internalized it.
Requirements gathering
For any case-study-shaped interview question, start with requirements. The full technique is covered in how to gather and prioritize requirements. For a rate limiter:
Functional requirements:
- Limit requests per client (by user ID, IP, API key, or a composite key)
- Support configurable limits (different limits for different endpoints or client tiers)
- Return meaningful responses when the limit is hit (429 status + Retry-After header)
- Support multiple time windows (per-second, per-minute, per-hour, per-day limits coexisting)
Non-functional requirements:
- Very low latency. Every request passes through the rate limiter, so it has to add minimal overhead — target <1ms per check.
- High availability. If the rate limiter goes down, the whole API becomes unreachable (or worse, uncapped). It must be at least as available as the service it protects.
- Distributed-safe. The service runs on hundreds of servers; the rate limiter must work correctly when checks happen in parallel across machines.
- Accurate enough. Perfect accuracy is expensive. A rate limiter that's 99% accurate is usually acceptable; one that drifts by 20% under load is not.
Out of scope (for initial design):
- Complex abuse detection (that's a separate anti-abuse service)
- Request prioritization or fair queuing under load (adjacent problem)
- Long-term quota management across billing periods (billing system, not rate limiter)
Capacity estimation
Concrete numbers matter. Imagine we're building the rate limiter behind an API gateway handling Twitter-scale traffic:
- 500M daily active users
- Average user makes 100 API calls/day → 50B requests/day
- Peak QPS: 50B / 86,400 × 3 (peak multiplier) ≈ 1.7M QPS
- Each rate limit check: need ~50 bytes to track (key + counter + timestamp)
- Active rate-limit keys at any moment: ~500M (one per user)
- Total memory: 500M × 50 bytes = ~25 GB
Two critical implications:
- At 1.7M QPS, the rate limiter can't involve a disk read. Every check has to be in-memory. This is why Redis is the canonical choice — in-memory, atomic operations, sub-millisecond latency.
- 25 GB of state is too much for a single Redis node. We need Redis clustering or sharding. More on that in the distributed section.
For the full estimation technique, see back-of-the-envelope estimation.
The four rate limiting algorithms
This is the technical core. There are four widely-used rate limiting algorithms, each with a different trade-off profile. In an interview, you should be able to name all four, explain the trade-offs, and justify your choice.
1. Token bucket
The most popular algorithm, used by AWS, Stripe, and most modern API gateways.
How it works: Each client has a "bucket" of tokens. Tokens refill at a constant rate (say, 10 tokens per second). Every request consumes one token. If the bucket is empty, the request is rejected.
on request(client_id):
bucket = get_or_create_bucket(client_id)
refill(bucket) # add tokens based on time elapsed
if bucket.tokens >= 1:
bucket.tokens -= 1
return ALLOW
else:
return DENY
refill(bucket):
now = current_time()
elapsed = now - bucket.last_refill
new_tokens = elapsed * refill_rate
bucket.tokens = min(bucket.capacity, bucket.tokens + new_tokens)
bucket.last_refill = now
Key property: allows bursts. If a client hasn't made requests for a while, their bucket fills up to capacity. They can then burst-use the accumulated tokens quickly. This is usually what you want — real users don't send perfectly even traffic.
Parameters: bucket capacity (max tokens) + refill rate. Example: capacity=100, refill=10/sec gives "average 10/sec, burst up to 100."
Strengths: flexible, handles bursts naturally, memory-efficient (one bucket per client). Weaknesses: under sustained overload, can allow more than the nominal rate (bursts add up).
When to use: API rate limiting where bursts are acceptable. This is the default choice for almost all modern APIs.
2. Leaky bucket
A close cousin of token bucket with different burst semantics.
How it works: Think of a bucket with a hole in the bottom. Incoming requests fill the bucket; the hole lets them drain at a constant rate. If the bucket overflows, excess requests are rejected.
on request(client_id):
bucket = get_or_create_bucket(client_id)
drain(bucket) # remove requests based on time elapsed
if bucket.size < bucket.capacity:
bucket.size += 1
return ALLOW # request queued for processing at drain rate
else:
return DENY
drain(bucket):
now = current_time()
elapsed = now - bucket.last_drain
drained = elapsed * drain_rate
bucket.size = max(0, bucket.size - drained)
bucket.last_drain = now
Key property: enforces smooth output rate. Unlike token bucket, leaky bucket processes requests at a constant rate regardless of input bursts. Incoming bursts get queued (or rejected if queue overflows), but downstream systems see a smooth request stream.
Strengths: smooth output, protects downstream systems from bursty load. Weaknesses: can add latency (requests wait in the bucket), doesn't absorb bursts as gracefully as token bucket from the client's perspective.
When to use: when downstream systems need smooth input (e.g., rate-limiting writes to a database that can't handle bursts). Less common than token bucket for general API rate limiting.
3. Fixed window counter
The simplest algorithm — and the one with the most famous edge case.
How it works: Divide time into fixed windows (e.g., 1-minute windows). For each client, count requests in the current window. If count exceeds the limit, reject.
on request(client_id):
current_window = floor(now / window_size)
key = client_id + ":" + current_window
count = increment(key)
if count > limit:
return DENY
else:
return ALLOW
Strengths: dead simple, minimal memory (one counter per client per window), easy to implement in Redis with INCR + EXPIRE.
The famous edge case — burst at window boundaries. If the limit is 100 per minute, a client could make 100 requests in the last second of one window AND 100 requests in the first second of the next window — effectively 200 requests in 2 seconds, double the intended rate. For strict rate limiting, this is unacceptable.
When to use: when you need simplicity and the boundary-burst issue doesn't matter. Good for non-critical throttling (e.g., UI-level rate limiting, user-facing search suggestions).
4. Sliding window
The accurate answer that fixes the fixed-window edge case. Two variants:
Sliding window log: Store the timestamp of every request in the last window. On each request, count timestamps newer than now - window_size. Accurate, but memory grows with request rate.
on request(client_id):
log = get_log(client_id)
cutoff = now - window_size
log.remove_older_than(cutoff)
if len(log) < limit:
log.add(now)
return ALLOW
else:
return DENY
Sliding window counter: Weighted combination of current and previous fixed-window counts. Approximates the sliding log with O(1) memory.
on request(client_id):
current = count_in_current_window(client_id)
previous = count_in_previous_window(client_id)
window_progress = (now % window_size) / window_size
# Weight previous window by how much of it overlaps "the last window from now"
estimated = previous * (1 - window_progress) + current
if estimated < limit:
increment_current(client_id)
return ALLOW
else:
return DENY
Strengths: much more accurate than fixed window, no boundary-burst problem. Weaknesses: log variant is memory-heavy; counter variant is an approximation (off by a few percent under adversarial patterns).
When to use: when fixed window's inaccuracy matters — billing-linked limits, security-critical limits (login attempts), tight SLAs with customers.
Picking among the four
The decision usually looks like:
- Token bucket — default choice for API rate limiting (allows bursts, memory-efficient, widely supported)
- Leaky bucket — when you specifically need smooth output (protecting fragile downstream systems)
- Fixed window counter — simple throttling where accuracy doesn't matter
- Sliding window — when boundary-burst accuracy matters (security, billing, strict SLAs)
In an interview, lead with token bucket and explicitly compare against the others. "I'd use token bucket because it handles bursts naturally and is simple to implement in Redis. If we needed stricter accuracy — say this were a login rate limit for security — I'd switch to sliding window log."
API design
The rate limiter exposes a minimal API to the services in front of it:
POST /v1/ratelimit/check
Body: { client_id, endpoint, [optional: cost] }
Returns: {
allowed: bool,
remaining: int,
reset_at: timestamp
}
Three things to note:
- The
costparameter. Not every request consumes one token — an expensive search might cost 10 tokens, a cheap GET might cost 1. Passing the cost gives the caller control without changing the API. - The response contains
remainingandreset_at. Clients can surface these as response headers (X-RateLimit-Remaining,X-RateLimit-Reset) so well-behaved clients self-throttle. - This is a synchronous check. Every request waits for the response. That's why latency matters so much — the rate limiter latency is added to every API call.
When the check denies a request, the caller should return HTTP 429 Too Many Requests with a Retry-After header indicating when the client can retry.
For the deeper treatment of API design, see mastering the API interview.
Data model and storage
At 1.7M QPS, the storage has to be:
- In-memory (disk reads are too slow)
- Atomic (multi-threaded checks must not race)
- Distributed-safe (state must be shared across all rate limiter instances)
- TTL-aware (we don't want to store counters for clients that went inactive forever)
That's Redis, essentially by definition. For the Redis deep dive, see Redis in system design.
Redis data structures for each algorithm:
- Fixed window:
INCR client:endpoint:window_idwithEXPIREat window end. One command, atomic, trivial. - Token bucket: Hash per client with fields
tokensandlast_refill. Use a Lua script to do refill + consume atomically (critical — otherwise two concurrent checks can both pass when only one should). - Sliding window log: Sorted set keyed by client. Members are timestamps, scores are timestamps. Use
ZADDto add,ZREMRANGEBYSCOREto trim old entries,ZCARDto count. - Sliding window counter: Two counters per client (current + previous window) with weighted logic in the check.
For all algorithms beyond fixed window, use Lua scripts for atomic multi-step logic. Redis guarantees single-threaded execution of Lua scripts — eliminating race conditions that would otherwise require distributed locks.
Architecture: where does the rate limiter live?
Three common placement patterns, each with trade-offs:
Pattern 1: API gateway rate limiting
Rate limiting lives in the API gateway (Nginx, Kong, AWS API Gateway, Cloudflare), in front of all services.
Pros: Traffic never reaches your services if it's rate-limited — maximum protection. Centralized configuration. Usually has built-in rate limiting features you don't have to build yourself.
Cons: Gateway becomes a single point of contention. Custom rate limiting logic (per-endpoint costs, tier-specific limits) may require gateway extensions.
When to use: As the default for public-facing APIs. Combine with the next pattern for per-service control.
Pattern 2: Middleware rate limiting in the service
Rate limiting is a library/middleware inside each service (e.g., a middleware in Express.js, a filter in Spring Boot).
Pros: Service-specific logic, easy customization, no separate infrastructure.
Cons: Each service has to implement/maintain rate limiting. Traffic already hit your service before being rate-limited.
When to use: For internal service-to-service rate limiting where the overhead of a separate gateway isn't justified.
Pattern 3: Sidecar rate limiter
A separate rate limiter service (process or container) that every service calls before processing a request. Envoy's rate limiting service is the canonical example.
Pros: Centralized logic, language-independent, updatable without touching services.
Cons: Extra network hop per request (latency). Another service to operate.
When to use: Large microservice architectures where centralization matters and latency budget allows.
The common production pattern: Pattern 1 at the edge (coarse rate limiting) + Pattern 2 or 3 internally (fine-grained per-service rate limiting). Two layers, each doing what it's best at.
For the deeper architecture discussion, see load balancer vs reverse proxy vs API gateway.
The distributed rate limiting problem
Now for the part that actually makes this a senior-level interview question. Everything above assumes a single rate limiter instance. In production, you have hundreds of service instances, each potentially doing their own rate limit checks. This creates real correctness problems.
The naive approach — per-instance rate limiting — breaks immediately. If the limit is "100 req/sec per user" and you have 10 service instances, each maintaining its own counter, a user could actually get 1,000 req/sec (100 on each instance). No good.
The standard solution — centralized Redis. All rate limit state lives in a single Redis cluster. Every service instance checks Redis on every request. Redis handles the atomicity via Lua scripts or INCR.
This works, but raises new problems:
Problem: Redis becomes a bottleneck
At 1.7M QPS, a single Redis node can't handle it. Solution: Redis Cluster with sharding by client_id. Each rate limit key lives on one shard; cluster fans out across many nodes. Add shards as QPS grows.
Problem: Redis latency adds to every request
Even with Redis at sub-millisecond, at 1.7M QPS you're doing 1.7M network round trips per second. Solutions:
- Connection pooling to amortize TCP overhead
- Pipelining multiple checks into one round trip where possible
- Local caching with periodic Redis sync — each service keeps a small local bucket, syncs with Redis every few hundred requests. Trades a small amount of accuracy for significant latency savings. Used by Stripe's rate limiter.
Problem: Redis fails
When Redis is down, you have three options, each a trade-off:
- Fail closed (deny all requests). Most correct — you genuinely don't know if the request is allowed. Costs availability.
- Fail open (allow all requests). Preserves availability. Costs protection — a DDoS during a Redis outage goes through.
- Fall back to local counters per instance. Allows more than the intended limit, but puts an upper cap on damage. Usually the right answer.
Most production rate limiters fail open OR fall back to local counters. Failing closed on your rate limiter taking down your whole API is a bad outcome.
This trade-off connects directly to CAP theorem and PACELC — rate limiters usually pick availability over perfect consistency, accepting that during network partitions they'll over-allow rather than under-allow.
Problem: consistency across regions
If your service runs in multiple regions, do you want a global rate limit or a per-region limit? Global is harder (cross-region Redis replication is slow) but more correct. Per-region is easy but can allow more than the nominal global limit.
Most systems use per-region rate limits with periodic reconciliation, accepting the inaccuracy. If you truly need strict global limits, you accept the latency hit of cross-region coordination.
How real companies do it
Anchoring your interview answer in real-world practice is powerful. A few documented approaches:
Stripe uses token bucket with Redis-backed state, heavily tuned for latency. They famously wrote about pushing Redis to handle hundreds of thousands of rate limit checks per second from thousands of workers, using local caches + Redis reconciliation.
GitHub's API uses fixed-window counters exposed via standard HTTP headers (X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset). They also use a secondary rate limit for anti-abuse — a sliding window that catches burst patterns the fixed window misses.
Twitter (now X) uses a multi-tier approach — coarse gateway-level limiting at the edge, per-endpoint limits internally, and aggressive rate limiting on write endpoints vs read endpoints.
AWS API Gateway uses token bucket with configurable burst and steady-state rates. You configure two numbers: the sustained rate and the burst capacity.
Common interview follow-up questions
Once you've designed the initial rate limiter, expect these follow-ups:
- "How do you handle a hot shard in the Redis cluster?" (Composite keys with write sharding — if user_id "celebrity" is a hot key, split it into "celebrity:0", "celebrity:1", ... "celebrity:9", aggregate on read.)
- "What happens when the Redis cluster becomes unavailable?" (Fallback strategy — local counters + log for reconciliation, or fail-open with monitoring.)
- "How do you support different limits for different customer tiers?" (Store limits in a separate config store, looked up per-client on check. Cache config aggressively.)
- "How does this interact with autoscaling?" (When new service instances spin up, they should reuse the shared Redis state — the rate limit is per-user, not per-instance.)
- "What's the memory footprint at scale?" (Walk through the math — 500M users × 50 bytes ≈ 25GB, need multi-shard Redis.)
- "How do you prevent clients from bypassing the rate limiter by using many IPs?" (Rate limit by account/API key primarily, fall back to IP for unauthenticated requests, use device fingerprinting for abuse detection — but that's a separate anti-abuse system.)
- "How do you rate-limit by multiple dimensions simultaneously (per-user AND per-endpoint AND per-IP)?" (Multiple parallel checks with AND logic — deny if any limit is exceeded. Each check is its own Redis key.)
- "How would you design a rate limiter for a high-traffic event — Black Friday, ticket sales?" (Pre-provision Redis capacity, implement priority tiers, use request queuing instead of rejection for premium users. The ticketing system case study covers the flash-sale version of this.)
If you can handle six of these eight without hesitating, you've nailed the interview.
Putting it all together
The one-sentence version: Rate limiting caps request rate per client to protect your system. Use token bucket by default, back it with Redis for distributed state, accept small accuracy trade-offs in exchange for availability during failures, and place the rate limiter at the API gateway for maximum reach.
In an interview, structure your answer like this:
- Clarify requirements out loud (what counts as a client, what are the limits, what's the latency budget)
- Do rough capacity estimation (QPS, memory, Redis sizing)
- Pick an algorithm (token bucket by default, explain why you'd pick differently)
- Describe the data model and storage (Redis, data structure per algorithm, atomicity via Lua)
- Place the rate limiter architecturally (gateway + service middleware, usually both)
- Discuss distributed correctness (centralized Redis, local cache + sync, Redis failure strategy)
- Handle the follow-ups gracefully — and notice that most of them are about failure modes, not happy paths
That structure separates juniors ("I'd use token bucket") from seniors ("I'd use token bucket for typical API rate limiting, back it with Redis Cluster sharded by client_id, implement the refill+consume logic in a Lua script for atomicity, fall back to per-instance counters if Redis goes unavailable, and expose standard X-RateLimit-* headers for well-behaved clients to self-throttle").
Good luck with your next interview.
Keep learning
Related posts for the topics this guide touched:
- Redis Guide for System Design — the deep dive on the storage backend that makes rate limiting work.
- Caching for System Design — related concept; rate limiting and caching overlap in several patterns.
- Mastering the API Interview — rate limiting is a core API design topic.
- Load Balancer vs Reverse Proxy vs API Gateway — where rate limiting fits in the request path.
- High Availability in System Design — relevant for rate limiter failure modes.
- CAP Theorem vs PACELC — the framework for understanding the consistency trade-offs in distributed rate limiting.
- Grokking Idempotency — related correctness concept for safe retries after 429s.
- How to Design a Ticketing System — a case study where rate limiting is the critical component (flash sales).
- Rate limiting is one of the 15+ case studies in Grokking System Design.-
For the full system design interview roadmap, start with my complete system design interview guide.
What our users say
Eric
I've completed my first pass of "grokking the System Design Interview" and I can say this was an excellent use of money and time. I've grown as a developer and now know the secrets of how to build these really giant internet systems.
ABHISHEK GUPTA
My offer from the top tech company would not have been possible without Grokking System Design. Many thanks!!
Vivien Ruska
Hey, I wasn't looking for interview materials but in general I wanted to learn about system design, and I bumped into 'Grokking the System Design Interview' on designgurus.io - it also walks you through popular apps like Instagram, Twitter, etc.👌
Access to 50+ courses
New content added monthly
Certificate of completion
$29.08
/month
Billed Annually
Recommended Course

Grokking the Advanced System Design Interview
39,474+ students
4.1
Grokking the System Design Interview. This course covers the most important system design questions for building distributed and scalable systems.
View CourseRead More
Top 10 System Design Interview Questions — With Detailed Answers and Frameworks
Arslan Ahmad
From UUID to Snowflake: Understanding Database Fragmentation
Arslan Ahmad
High Availability in System Design: 15 Strategies for Always-On Systems
Arslan Ahmad
Synchronous vs Asynchronous Communication – The Key Differences
Arslan Ahmad