How to Implement Rate Limiting in REST APIs

Rate limiting controls unfairness. You decide who gets to make how many requests, to which endpoints, in what time window, and what happens when they go over. Done well, it prevents noisy neighbors, protects downstream dependencies, and gives you a clean lever for “free vs paid” product tiers. Done poorly, it becomes random 429s, thundering herds, and a support queue full of angry screenshots.

The trick is to treat rate limiting as a product feature and an infrastructure feature at the same time: consistent rules, predictable headers, sensible burst behavior, and an implementation that holds up under concurrency.

What “good” looks like in real APIs

You want three things that experienced API consumers immediately notice:

First, correct HTTP semantics. When you throttle, return 429 Too Many Requests. If you can compute a meaningful wait time, include Retry-After so clients can back off without guessing.

Second, a transparent quota state. Clients should be able to inspect response headers and know what just happened: what the limit is, how much remains, and when they can safely retry. Some APIs use legacy X-RateLimit-* headers because they are widely recognized. Newer guidance is moving toward standardized RateLimit-* headers. Either way, pick one approach and document it.

Third, burst behavior that matches reality. Many systems are fine with short spikes but fail under sustained load. Your limiter should reflect that.

Start by choosing what you are limiting (this matters more than the algorithm)

Most teams start with “requests per IP per minute.” That’s an okay emergency brake and a bad long-term fairness policy.

A better approach is to define the quota partition based on identity and cost:

API key or user ID for authenticated clients (usually the fairest).
Tenant or account ID for B2B SaaS (protects multi-tenant systems).
IP address as a backstop for unauthenticated endpoints (signup, login, password reset).
Route-level limits for expensive endpoints (report generation, search, export).
Concurrency limits for operations that tie up scarce resources (long-running jobs, streaming, heavy DB queries).

In practice, the best pattern is two layers: a global per-client limit that protects everything, plus a tighter per-route (or per-cost) limit that protects what melts your stack.

Pick an algorithm that matches your traffic shape

Rate limiting algorithms are not “better or worse.” They are “honest about bursts” vs “honest about averages.”

Token bucket
This is the workhorse for API gateways. You get a steady average rate plus controlled bursts. If you want to allow a client to spike briefly but not sustain a flood, the token bucket is your friend.

Leaky bucket
This smooths traffic more aggressively. Requests “drip” through at a fixed rate. It’s useful when bursts are specifically what hurts you, but it can feel more restrictive for clients.

Sliding window
This avoids the classic “window boundary” loophole where clients can double-dip across minute boundaries. It is also more fair under spiky traffic, at the cost of slightly more bookkeeping.

A quick worked example (so your numbers stop being vibes)

Say your read endpoint has a limit of 60 requests per minute per API key.

With a naive fixed window, a client can send 60 requests at 12:00:59 and another 60 at 12:01:00, which is effectively 120 requests in about one second. If your downstream system falls over when it sees spikes, that loophole becomes a real incident.

With a token bucket tuned to 1 request per second average and a burst capacity of 20, the client can spend 20 tokens quickly, then they naturally settle into the average rate. This gives you “small bursts are fine, sustained load is not.”

With a sliding window approach, it’s much harder to game the boundary at all, because you are always measuring the last 60 seconds, not a calendar minute.

Implement in the right place: edge, app, or shared limiter

There are three common deployment options. Most production systems use two of them.

1) Edge limiting (gateway/proxy)
Great for blocking obvious abuse early and cheaply. You stop bad traffic before it consumes app CPU, DB connections, or upstream calls. This is also where you enforce “global safety limits” for unauthenticated traffic.

2) App-level limiting (middleware)
Best when limits depend on business logic: plan tier, endpoint cost, user risk, partner contracts, or tenant-level budgets. You can also do smarter exceptions here, like allowing internal services or trusted partner IP ranges.

3) Distributed limiter (Redis or a dedicated service)
If you run multiple app instances, you usually need a shared source of truth so limits stay consistent across pods and zones. (Without careful coordination, distributed counters can introduce race conditions in distributed caching that silently break your limits.) Redis is a common choice because it can do atomic counters and scripting. A dedicated rate-limit service can be cleaner for very large systems, but it’s another component to run.

A pragmatic architecture: coarse limits at the edge, then fine-grained per-tenant and per-route rules inside the app backed by a distributed store.

A Redis approach you can ship: sliding window counter (atomic)

If you want fairness across instances and fewer boundary spikes, a sliding window counter is a strong default.

The idea: keep counters for the current window and the previous window. Weight the previous window based on how far you are into the current one. Do it atomically so concurrency does not break your math.

— Sliding window counter (near-exact) in Redis
— Keys:
— KEYS[1] = current window key (e.g. “rl:{client}:1709553720”)
— KEYS[2] = previous window key (e.g. “rl:{client}:1709553660”)
— Args:
— ARGV[1] = limit (integer)
— ARGV[2] = window_size_ms (integer, e.g. 60000)
— ARGV[3] = now_ms (integer)
—
— Returns:
— { allowed (0/1), remaining (integer), retry_after_ms (integer) }local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local now = tonumber(ARGV[3])local curr = tonumber(redis.call(“GET”, KEYS[1]) or “0”)
local prev = tonumber(redis.call(“GET”, KEYS[2]) or “0”)

local elapsed = now % window
local weight = (window – elapsed) / window

local effective = curr + (prev * weight)

if effective >= limit then
local retry_after = window – elapsed
return {0, 0, math.floor(retry_after)}
end

curr = tonumber(redis.call(“INCR”, KEYS[1]))
redis.call(“PEXPIRE”, KEYS[1], window * 2)

local new_effective = curr + (prev * weight)
local remaining = math.floor(limit – new_effective)
if remaining < 0 then remaining = 0 end

return {1, remaining, 0}

You can map retry_after_ms to Retry-After (seconds) and return 429 when allowed is 0.

Make rejection predictable: status codes, headers, and body

When you throttle, your goal is “client can recover automatically,” not “client panics.”

Use:

HTTP 429 for throttling.
Retry-After when you can compute a sane delay.
Rate limit headers that describe policy and current state.

If you want an easy, client-friendly set, these are common fields to expose:

Limit: the quota for this scope
Remaining: How many requests are left
Reset: when the counter resets (or the next token is available)
Optional: Policy: a string describing the window and rules

In the response body, keep it short and machine-readable:

{

“error”: “rate_limited”,

“message”: “Too many requests”,

“retry_after_seconds”: 12

}

Make it observable and resilient, or you will debug ghosts

Rate limiting is a control system. You need visibility into what it’s doing.

Track:

allowed vs rejected counts
which rule triggered (global, route, tenant)
top limited clients and endpoints
average and p95 Retry-After

Then decide what happens when the limiter dependency fails:

Fail open: allow traffic if Redis is down, but keep coarse edge limits to prevent disasters.
Fail closed: block traffic when you cannot check limits, usually only for high-risk endpoints.

Most APIs fail open for general traffic and fail closed for the endpoints that attackers love (auth, password reset, expensive compute).

FAQ

Should I rate limit by IP?

As a fallback, yes, especially for unauthenticated endpoints. But IP-only limits can punish innocent users behind NATs or corporate proxies. Prefer API key, user ID, or tenant ID when you have it.

Token bucket vs sliding window, which should I pick?

If you want controlled bursts, token bucket. If you want fewer boundary games and smoother fairness, sliding window.

Is gateway throttling enough?

It’s a great first layer, but it rarely understands plan tiers, tenant budgets, or endpoint cost. Use it for coarse protection, then enforce business-aware rules in the app.

Honest Takeaway

If you implement only one thing, implement consistent 429 responses with Retry-After and clear rate limit headers, keyed on real identity (API key, user, tenant). That alone turns “random throttling” into something clients can engineer around.

If you want a limiter that behaves under concurrency and across multiple instances, add a distributed limiter (often Redis) for your expensive routes, and keep coarse limits at the edge so bad traffic dies early. (Monitoring how these layers respond under pressure reveals latency signals that predict where your architecture will break.)