On this page

What a rate limiter actually does

Why rate limiting matters (the business case)

Requirements gathering

Capacity estimation

The four rate limiting algorithms

Token bucket

Leaky bucket

Fixed window counter

Sliding window

Picking among the four

API design

Data model and storage

Architecture: where does the rate limiter live?

Pattern 1: API gateway rate limiting

Pattern 2: Middleware rate limiting in the service

Pattern 3: Sidecar rate limiter

The distributed rate limiting problem

Problem: Redis becomes a bottleneck

Problem: Redis latency adds to every request

Problem: Redis fails

Problem: consistency across regions

How real companies do it

Common interview follow-up questions

Putting it all together

Keep learning

How to Design a Rate Limiter: Algorithms, Architecture, and Trade-offs

Arslan Ahmad

April 16th, 2026

How to design a rate limiter: token bucket, leaky bucket, fixed window, and sliding window algorithms — plus distributed rate limiting with Redis.

On This Page

What a rate limiter actually does

Why rate limiting matters (the business case)

Requirements gathering

Capacity estimation

The four rate limiting algorithms

Token bucket

Leaky bucket

Fixed window counter

Sliding window

Picking among the four

API design

Data model and storage

Architecture: where does the rate limiter live?

Pattern 1: API gateway rate limiting

Pattern 2: Middleware rate limiting in the service

Pattern 3: Sidecar rate limiter

The distributed rate limiting problem

Problem: Redis becomes a bottleneck

Problem: Redis latency adds to every request

Problem: Redis fails

Problem: consistency across regions

How real companies do it

Common interview follow-up questions

Putting it all together

Keep learning

"How would you design a rate limiter?"

That question has shown up in more FAANG system design interviews than almost any other component-level question. It's asked because rate limiting sits at the intersection of almost every system design concept you care about: distributed counters, consistency trade-offs, Redis, algorithmic choice, fault tolerance, graceful degradation. If you can design a rate limiter cleanly, you've demonstrated competence across half the system design curriculum in one 45-minute conversation.

This guide is the full answer. By the time you're done, you'll know:

What a rate limiter does and why every non-trivial system has one
The four main rate limiting algorithms — how each works, with pseudocode, and when to pick which
How real companies (Stripe, GitHub, Twitter) build rate limiters at scale
The hardest part: distributed rate limiting across many servers
How to answer the interview question step by step, following a senior-engineer structure

Let's dig in.

What a rate limiter actually does

A rate limiter is a mechanism that caps the number of requests or actions a client can perform within a time window. When a client exceeds the limit, the rate limiter rejects the excess (usually with an HTTP 429 status code — "Too Many Requests").

Think of it as a bouncer at a club. The club has capacity for 500 people. The bouncer counts who's inside, and once the count hits 500, new arrivals wait outside. When someone leaves, the bouncer lets one more in.

In production systems, rate limiting shows up everywhere:

API gateways throttling requests per client API key (Stripe: 100 requests/sec per account)
Login endpoints preventing brute-force attacks (5 attempts per username per minute)
Write-heavy endpoints protecting expensive operations (post creation, file uploads)
Expensive queries limiting resource-intensive calls (search, aggregation, exports)
External service calls staying under third-party API quotas (Twilio, SendGrid, payment providers)
CDN and DDoS protection absorbing traffic spikes at the edge (Cloudflare, AWS Shield)

Without rate limiting, one misbehaving client — intentional or accidental — can take down your whole service. The classic example: a customer's script gets stuck in a loop and hammers your API with 10,000 requests per second. Without a rate limiter, this brings down every other customer too. With a rate limiter, only the misbehaving customer sees errors.

Why rate limiting matters (the business case)

Before algorithms, the "why" matters — interviewers listen for whether you understand the business reasons, not just the mechanics.

Prevent overload and downtime. A traffic spike (intentional or a Denial-of-Service attack) can cascade through your system in seconds. Rate limiting is the first line of defense — it caps traffic before it reaches the vulnerable parts of your stack.

Fair resource allocation. On a shared service, one noisy client can monopolize resources and degrade everyone else's experience. Rate limiting ensures each client gets a fair share.

Security against abuse. Credential stuffing, web scraping, enumeration attacks — they all rely on sending lots of requests fast. Rate limiting at the right layer kills these attacks structurally, not just via detection.

Cost control. Every request costs you — CPU, bandwidth, database load, downstream API calls. If you call a paid API (Twilio, OpenAI, Stripe) on every user request and one user goes into a loop, the unbounded cost is yours. Rate limiting caps that.

Contractual enforcement. If you sell tiered API access (free = 100 req/day, pro = 10,000 req/day, enterprise = unlimited), rate limiting is literally how you enforce the tier.

In an interview, lead with these reasons briefly, then get to the design. Candidates who jump straight to algorithms without explaining the "why" signal that they've memorized material rather than internalized it.

Requirements gathering

For any case-study-shaped interview question, start with requirements. The full technique is covered in how to gather and prioritize requirements. For a rate limiter:

Functional requirements:

Limit requests per client (by user ID, IP, API key, or a composite key)
Support configurable limits (different limits for different endpoints or client tiers)
Return meaningful responses when the limit is hit (429 status + Retry-After header)
Support multiple time windows (per-second, per-minute, per-hour, per-day limits coexisting)

Non-functional requirements:

Very low latency. Every request passes through the rate limiter, so it has to add minimal overhead — target <1ms per check.
High availability. If the rate limiter goes down, the whole API becomes unreachable (or worse, uncapped). It must be at least as available as the service it protects.
Distributed-safe. The service runs on hundreds of servers; the rate limiter must work correctly when checks happen in parallel across machines.
Accurate enough. Perfect accuracy is expensive. A rate limiter that's 99% accurate is usually acceptable; one that drifts by 20% under load is not.

Out of scope (for initial design):

Complex abuse detection (that's a separate anti-abuse service)
Request prioritization or fair queuing under load (adjacent problem)
Long-term quota management across billing periods (billing system, not rate limiter)

Capacity estimation

Concrete numbers matter. Imagine we're building the rate limiter behind an API gateway handling Twitter-scale traffic:

500M daily active users
Average user makes 100 API calls/day → 50B requests/day
Peak QPS: 50B / 86,400 × 3 (peak multiplier) ≈ 1.7M QPS
Each rate limit check: need ~50 bytes to track (key + counter + timestamp)
Active rate-limit keys at any moment: ~500M (one per user)
Total memory: 500M × 50 bytes = ~25 GB

Two critical implications:

At 1.7M QPS, the rate limiter can't involve a disk read. Every check has to be in-memory. This is why Redis is the canonical choice — in-memory, atomic operations, sub-millisecond latency.
25 GB of state is too much for a single Redis node. We need Redis clustering or sharding. More on that in the distributed section.

For the full estimation technique, see back-of-the-envelope estimation.

The four rate limiting algorithms

This is the technical core. There are four widely-used rate limiting algorithms, each with a different trade-off profile. In an interview, you should be able to name all four, explain the trade-offs, and justify your choice.

1. Token bucket

The most popular algorithm, used by AWS, Stripe, and most modern API gateways.

How it works: Each client has a "bucket" of tokens. Tokens refill at a constant rate (say, 10 tokens per second). Every request consumes one token. If the bucket is empty, the request is rejected.

on request(client_id):
  bucket = get_or_create_bucket(client_id)
  refill(bucket)  # add tokens based on time elapsed
  if bucket.tokens >= 1:
    bucket.tokens -= 1
    return ALLOW
  else:
    return DENY

refill(bucket):
  now = current_time()
  elapsed = now - bucket.last_refill
  new_tokens = elapsed * refill_rate
  bucket.tokens = min(bucket.capacity, bucket.tokens + new_tokens)
  bucket.last_refill = now

Key property: allows bursts. If a client hasn't made requests for a while, their bucket fills up to capacity. They can then burst-use the accumulated tokens quickly. This is usually what you want — real users don't send perfectly even traffic.

Parameters: bucket capacity (max tokens) + refill rate. Example: capacity=100, refill=10/sec gives "average 10/sec, burst up to 100."

Strengths: flexible, handles bursts naturally, memory-efficient (one bucket per client). Weaknesses: under sustained overload, can allow more than the nominal rate (bursts add up).

When to use: API rate limiting where bursts are acceptable. This is the default choice for almost all modern APIs.

2. Leaky bucket

A close cousin of token bucket with different burst semantics.

How it works: Think of a bucket with a hole in the bottom. Incoming requests fill the bucket; the hole lets them drain at a constant rate. If the bucket overflows, excess requests are rejected.

on request(client_id):
  bucket = get_or_create_bucket(client_id)
  drain(bucket)  # remove requests based on time elapsed
  if bucket.size < bucket.capacity:
    bucket.size += 1
    return ALLOW  # request queued for processing at drain rate
  else:
    return DENY

drain(bucket):
  now = current_time()
  elapsed = now - bucket.last_drain
  drained = elapsed * drain_rate
  bucket.size = max(0, bucket.size - drained)
  bucket.last_drain = now

Key property: enforces smooth output rate. Unlike token bucket, leaky bucket processes requests at a constant rate regardless of input bursts. Incoming bursts get queued (or rejected if queue overflows), but downstream systems see a smooth request stream.

Strengths: smooth output, protects downstream systems from bursty load. Weaknesses: can add latency (requests wait in the bucket), doesn't absorb bursts as gracefully as token bucket from the client's perspective.

When to use: when downstream systems need smooth input (e.g., rate-limiting writes to a database that can't handle bursts). Less common than token bucket for general API rate limiting.

3. Fixed window counter

The simplest algorithm — and the one with the most famous edge case.

How it works: Divide time into fixed windows (e.g., 1-minute windows). For each client, count requests in the current window. If count exceeds the limit, reject.

on request(client_id):
  current_window = floor(now / window_size)
  key = client_id + ":" + current_window
  count = increment(key)
  if count > limit:
    return DENY
  else:
    return ALLOW

Strengths: dead simple, minimal memory (one counter per client per window), easy to implement in Redis with INCR + EXPIRE.

The famous edge case — burst at window boundaries. If the limit is 100 per minute, a client could make 100 requests in the last second of one window AND 100 requests in the first second of the next window — effectively 200 requests in 2 seconds, double the intended rate. For strict rate limiting, this is unacceptable.

When to use: when you need simplicity and the boundary-burst issue doesn't matter. Good for non-critical throttling (e.g., UI-level rate limiting, user-facing search suggestions).

4. Sliding window

The accurate answer that fixes the fixed-window edge case. Two variants:

Sliding window log: Store the timestamp of every request in the last window. On each request, count timestamps newer than now - window_size. Accurate, but memory grows with request rate.

on request(client_id):
  log = get_log(client_id)
  cutoff = now - window_size
  log.remove_older_than(cutoff)
  if len(log) < limit:
    log.add(now)
    return ALLOW
  else:
    return DENY

Sliding window counter: Weighted combination of current and previous fixed-window counts. Approximates the sliding log with O(1) memory.

on request(client_id):
  current = count_in_current_window(client_id)
  previous = count_in_previous_window(client_id)
  window_progress = (now % window_size) / window_size
  # Weight previous window by how much of it overlaps "the last window from now"
  estimated = previous * (1 - window_progress) + current
  if estimated < limit:
    increment_current(client_id)
    return ALLOW
  else:
    return DENY

Strengths: much more accurate than fixed window, no boundary-burst problem. Weaknesses: log variant is memory-heavy; counter variant is an approximation (off by a few percent under adversarial patterns).

When to use: when fixed window's inaccuracy matters — billing-linked limits, security-critical limits (login attempts), tight SLAs with customers.

Picking among the four

The decision usually looks like:

Token bucket — default choice for API rate limiting (allows bursts, memory-efficient, widely supported)
Leaky bucket — when you specifically need smooth output (protecting fragile downstream systems)
Fixed window counter — simple throttling where accuracy doesn't matter
Sliding window — when boundary-burst accuracy matters (security, billing, strict SLAs)

In an interview, lead with token bucket and explicitly compare against the others. "I'd use token bucket because it handles bursts naturally and is simple to implement in Redis. If we needed stricter accuracy — say this were a login rate limit for security — I'd switch to sliding window log."

API design

The rate limiter exposes a minimal API to the services in front of it:

POST /v1/ratelimit/check
  Body: { client_id, endpoint, [optional: cost] }
  Returns: {
    allowed: bool,
    remaining: int,
    reset_at: timestamp
  }

Three things to note:

The cost parameter. Not every request consumes one token — an expensive search might cost 10 tokens, a cheap GET might cost 1. Passing the cost gives the caller control without changing the API.
The response contains remaining and reset_at. Clients can surface these as response headers (X-RateLimit-Remaining, X-RateLimit-Reset) so well-behaved clients self-throttle.
This is a synchronous check. Every request waits for the response. That's why latency matters so much — the rate limiter latency is added to every API call.

When the check denies a request, the caller should return HTTP 429 Too Many Requests with a Retry-After header indicating when the client can retry.

For the deeper treatment of API design, see mastering the API interview.

Data model and storage

At 1.7M QPS, the storage has to be:

In-memory (disk reads are too slow)
Atomic (multi-threaded checks must not race)
Distributed-safe (state must be shared across all rate limiter instances)
TTL-aware (we don't want to store counters for clients that went inactive forever)

That's Redis, essentially by definition. For the Redis deep dive, see Redis in system design.

Redis data structures for each algorithm:

Fixed window: INCR client:endpoint:window_id with EXPIRE at window end. One command, atomic, trivial.
Token bucket: Hash per client with fields tokens and last_refill. Use a Lua script to do refill + consume atomically (critical — otherwise two concurrent checks can both pass when only one should).
Sliding window log: Sorted set keyed by client. Members are timestamps, scores are timestamps. Use ZADD to add, ZREMRANGEBYSCORE to trim old entries, ZCARD to count.
Sliding window counter: Two counters per client (current + previous window) with weighted logic in the check.

For all algorithms beyond fixed window, use Lua scripts for atomic multi-step logic. Redis guarantees single-threaded execution of Lua scripts — eliminating race conditions that would otherwise require distributed locks.

Architecture: where does the rate limiter live?

Three common placement patterns, each with trade-offs:

Pattern 1: API gateway rate limiting

Rate limiting lives in the API gateway (Nginx, Kong, AWS API Gateway, Cloudflare), in front of all services.

Pros: Traffic never reaches your services if it's rate-limited — maximum protection. Centralized configuration. Usually has built-in rate limiting features you don't have to build yourself.

Cons: Gateway becomes a single point of contention. Custom rate limiting logic (per-endpoint costs, tier-specific limits) may require gateway extensions.

When to use: As the default for public-facing APIs. Combine with the next pattern for per-service control.

Pattern 2: Middleware rate limiting in the service

Rate limiting is a library/middleware inside each service (e.g., a middleware in Express.js, a filter in Spring Boot).

Pros: Service-specific logic, easy customization, no separate infrastructure.

Cons: Each service has to implement/maintain rate limiting. Traffic already hit your service before being rate-limited.

When to use: For internal service-to-service rate limiting where the overhead of a separate gateway isn't justified.

Pattern 3: Sidecar rate limiter

A separate rate limiter service (process or container) that every service calls before processing a request. Envoy's rate limiting service is the canonical example.

Pros: Centralized logic, language-independent, updatable without touching services.

Cons: Extra network hop per request (latency). Another service to operate.

When to use: Large microservice architectures where centralization matters and latency budget allows.

The common production pattern: Pattern 1 at the edge (coarse rate limiting) + Pattern 2 or 3 internally (fine-grained per-service rate limiting). Two layers, each doing what it's best at.

For the deeper architecture discussion, see load balancer vs reverse proxy vs API gateway.

The distributed rate limiting problem

Now for the part that actually makes this a senior-level interview question. Everything above assumes a single rate limiter instance. In production, you have hundreds of service instances, each potentially doing their own rate limit checks. This creates real correctness problems.

The naive approach — per-instance rate limiting — breaks immediately. If the limit is "100 req/sec per user" and you have 10 service instances, each maintaining its own counter, a user could actually get 1,000 req/sec (100 on each instance). No good.

The standard solution — centralized Redis. All rate limit state lives in a single Redis cluster. Every service instance checks Redis on every request. Redis handles the atomicity via Lua scripts or INCR.

This works, but raises new problems:

Problem: Redis becomes a bottleneck

At 1.7M QPS, a single Redis node can't handle it. Solution: Redis Cluster with sharding by client_id. Each rate limit key lives on one shard; cluster fans out across many nodes. Add shards as QPS grows.

Problem: Redis latency adds to every request

Even with Redis at sub-millisecond, at 1.7M QPS you're doing 1.7M network round trips per second. Solutions:

Connection pooling to amortize TCP overhead
Pipelining multiple checks into one round trip where possible
Local caching with periodic Redis sync — each service keeps a small local bucket, syncs with Redis every few hundred requests. Trades a small amount of accuracy for significant latency savings. Used by Stripe's rate limiter.

Problem: Redis fails

When Redis is down, you have three options, each a trade-off:

Fail closed (deny all requests). Most correct — you genuinely don't know if the request is allowed. Costs availability.
Fail open (allow all requests). Preserves availability. Costs protection — a DDoS during a Redis outage goes through.
Fall back to local counters per instance. Allows more than the intended limit, but puts an upper cap on damage. Usually the right answer.

Most production rate limiters fail open OR fall back to local counters. Failing closed on your rate limiter taking down your whole API is a bad outcome.

This trade-off connects directly to CAP theorem and PACELC — rate limiters usually pick availability over perfect consistency, accepting that during network partitions they'll over-allow rather than under-allow.

Problem: consistency across regions

If your service runs in multiple regions, do you want a global rate limit or a per-region limit? Global is harder (cross-region Redis replication is slow) but more correct. Per-region is easy but can allow more than the nominal global limit.

Most systems use per-region rate limits with periodic reconciliation, accepting the inaccuracy. If you truly need strict global limits, you accept the latency hit of cross-region coordination.

How real companies do it

Anchoring your interview answer in real-world practice is powerful. A few documented approaches:

Stripe uses token bucket with Redis-backed state, heavily tuned for latency. They famously wrote about pushing Redis to handle hundreds of thousands of rate limit checks per second from thousands of workers, using local caches + Redis reconciliation.

GitHub's API uses fixed-window counters exposed via standard HTTP headers (X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset). They also use a secondary rate limit for anti-abuse — a sliding window that catches burst patterns the fixed window misses.

Twitter (now X) uses a multi-tier approach — coarse gateway-level limiting at the edge, per-endpoint limits internally, and aggressive rate limiting on write endpoints vs read endpoints.

AWS API Gateway uses token bucket with configurable burst and steady-state rates. You configure two numbers: the sustained rate and the burst capacity.

Common interview follow-up questions

Once you've designed the initial rate limiter, expect these follow-ups:

"How do you handle a hot shard in the Redis cluster?" (Composite keys with write sharding — if user_id "celebrity" is a hot key, split it into "celebrity:0", "celebrity:1", ... "celebrity:9", aggregate on read.)
"What happens when the Redis cluster becomes unavailable?" (Fallback strategy — local counters + log for reconciliation, or fail-open with monitoring.)
"How do you support different limits for different customer tiers?" (Store limits in a separate config store, looked up per-client on check. Cache config aggressively.)
"How does this interact with autoscaling?" (When new service instances spin up, they should reuse the shared Redis state — the rate limit is per-user, not per-instance.)
"What's the memory footprint at scale?" (Walk through the math — 500M users × 50 bytes ≈ 25GB, need multi-shard Redis.)
"How do you prevent clients from bypassing the rate limiter by using many IPs?" (Rate limit by account/API key primarily, fall back to IP for unauthenticated requests, use device fingerprinting for abuse detection — but that's a separate anti-abuse system.)
"How do you rate-limit by multiple dimensions simultaneously (per-user AND per-endpoint AND per-IP)?" (Multiple parallel checks with AND logic — deny if any limit is exceeded. Each check is its own Redis key.)
"How would you design a rate limiter for a high-traffic event — Black Friday, ticket sales?" (Pre-provision Redis capacity, implement priority tiers, use request queuing instead of rejection for premium users. The ticketing system case study covers the flash-sale version of this.)

If you can handle six of these eight without hesitating, you've nailed the interview.

Putting it all together

The one-sentence version: Rate limiting caps request rate per client to protect your system. Use token bucket by default, back it with Redis for distributed state, accept small accuracy trade-offs in exchange for availability during failures, and place the rate limiter at the API gateway for maximum reach.

In an interview, structure your answer like this:

Clarify requirements out loud (what counts as a client, what are the limits, what's the latency budget)
Do rough capacity estimation (QPS, memory, Redis sizing)
Pick an algorithm (token bucket by default, explain why you'd pick differently)
Describe the data model and storage (Redis, data structure per algorithm, atomicity via Lua)
Place the rate limiter architecturally (gateway + service middleware, usually both)
Discuss distributed correctness (centralized Redis, local cache + sync, Redis failure strategy)
Handle the follow-ups gracefully — and notice that most of them are about failure modes, not happy paths

That structure separates juniors ("I'd use token bucket") from seniors ("I'd use token bucket for typical API rate limiting, back it with Redis Cluster sharded by client_id, implement the refill+consume logic in a Lua script for atomicity, fall back to per-instance counters if Redis goes unavailable, and expose standard X-RateLimit-* headers for well-behaved clients to self-throttle").

Good luck with your next interview.

Keep learning

Related posts for the topics this guide touched:

Redis Guide for System Design — the deep dive on the storage backend that makes rate limiting work.
Caching for System Design — related concept; rate limiting and caching overlap in several patterns.
Mastering the API Interview — rate limiting is a core API design topic.
Load Balancer vs Reverse Proxy vs API Gateway — where rate limiting fits in the request path.
High Availability in System Design — relevant for rate limiter failure modes.
CAP Theorem vs PACELC — the framework for understanding the consistency trade-offs in distributed rate limiting.
Grokking Idempotency — related correctness concept for safe retries after 429s.
How to Design a Ticketing System — a case study where rate limiting is the critical component (flash sales).
Rate limiting is one of the 15+ case studies in Grokking System Design.-

For the full system design interview roadmap, start with my complete system design interview guide.

System Design Interview

Rate Limiting

What our users say

Eric

I've completed my first pass of "grokking the System Design Interview" and I can say this was an excellent use of money and time. I've grown as a developer and now know the secrets of how to build these really giant internet systems.

ABHISHEK GUPTA

My offer from the top tech company would not have been possible without Grokking System Design. Many thanks!!

Vivien Ruska

Hey, I wasn't looking for interview materials but in general I wanted to learn about system design, and I bumped into 'Grokking the System Design Interview' on designgurus.io - it also walks you through popular apps like Instagram, Twitter, etc.👌