Building APIs That Handle Millions of Requests

At that load, the enemy is not a single slow endpoint. It is death by a thousand paper cuts: one extra network hop, one chatty database call, one retry storm, one GC pause, one missing timeout. The trick is to design an API that stays boring under stress, where overload looks like graceful degradation instead of a cascading outage.

Let’s make this concrete. 1 million requests per minute is 16,667 requests per second. 5 million per minute is 83,333 rps. If your p95 budget is 200 ms end-to-end, you are managing tens of thousands of concurrent in-flight requests, plus retries, plus the fact that traffic is spiky and not a flat line.

Do the math first, or you will guess wrong later

Before you talk about microservices, gRPC, or service meshes, do a back-of-the-napkin capacity model. It forces you to confront real constraints: CPU, memory, connection limits, and the database.

A worked example:

Target: 5,000,000 requests per minute = 83,333 rps
Assume a single API instance can safely handle 2,000 rps at p95, after auth, routing, and serialization overhead.
You need 83,333 / 2,000 = 41.7, call it 50 instances to cover headroom, uneven load, deployments, and a regional failure.
If average payload is 2 KB in and 2 KB out, that is ~4 KB per request. At 83,333 rps, you are pushing ~333 MB per second, roughly 2.7 Gbps, before TLS overhead and spikes.

This is why autoscaling alone is not a strategy. It helps, but only if the rest of the system behaves under surge.

Treat failure as normal, not exceptional

This is where serious system design starts: you assume things will break, and you design the blast radius.

Werner Vogels, CTO at Amazon, has consistently argued that distributed systems must be designed with the expectation of failure. In Amazon’s architecture guidance, reliability is framed around isolating components and minimizing cascading impact. Failure is not an edge case, it is background noise.

Ben Treynor Sloss, who helped found Google’s SRE practice, made reliability a quantifiable engineering discipline. The SRE approach treats uptime and latency as measurable objectives with explicit tradeoffs, not vague aspirations. You set error budgets and build systems that respect them.

Stripe’s engineering team operationalized this mindset by baking idempotency into their API design. They introduced idempotency keys for write operations so clients can safely retry without duplicating side effects. That design decision directly addresses the messy reality of real networks.

Put those together, and you get a blunt synthesis: at a million-plus RPM, resilience is a product feature. Your API contract must define what happens when the system is slow, overloaded, or partially degraded.

Control the firehose at the edge with hard limits

Your first scaling win is preventing bad traffic from becoming expensive traffic.

A mature edge layer, often an API gateway or reverse proxy, should handle:

TLS termination and authentication
Routing and request normalization
Rate limiting and quotas
Timeouts and circuit breaking
Structured logging and trace propagation

Rate limiting is not optional at this scale. It is load management.

Here is a practical comparison of common rate-limiting strategies:

Strategy	Best for	Weakness under surge
Fixed window	Simple per-minute quotas	Burst at window boundaries
Sliding window	Fairer distribution	More state and computational complexity
Token bucket	Controlled bursts	Harder distributed coordination
Leaky bucket	Smooth steady throughput	Can introduce perceived latency

Two rules matter more than the algorithm name.

First, fail closed for unsafe endpoints. If overload causes duplicate charges or duplicate writes, your problem is no longer performance.

Second, return fast 429 responses before upstream threads, connection pools, or databases saturate. Protect the core at all costs.

Make request handling cheap and predictable

At high RPS, you pay for every abstraction with CPU cycles and latency variance. Your goal is to make each request boring and inexpensive.

Practical levers that consistently move the needle:

Keep connections warm and reuse them.
Avoid excessive per-request allocations.
Bound concurrency instead of allowing unlimited in-flight work.
Set strict timeouts on every network hop.
Retry only idempotent operations, with jitter and caps.

If you do write operations, build idempotency into the contract. That single design decision prevents entire classes of catastrophic retry amplification.

Also watch for hidden multipliers. Logging every request synchronously, performing per-request schema validation with heavy reflection, or making three downstream calls when one would do are common self-inflicted bottleneck.

Scale the data layer by reducing database work

Most APIs do not collapse in the stateless tier. They collapse when the database becomes the shared choke point.

Patterns that scale in practice:

Aggressive caching with explicit TTLs.
Cache at the edge where possible, then in memory, then fall back to the database. Decide what can be slightly stale and for how long.

Partitioning with a plan for hot keys.
If one user or object can become extremely popular, you need mitigation. That might mean key salting, request coalescing, or isolating hot tenants.

Asynchronous side effects.
Turn “write and perform five additional operations” into “write, enqueue, acknowledge.” Process the rest outside the critical path.

Dedicated read models for read heavy endpoints.
Search, feeds, and analytics endpoints often need denormalized stores. Serving them from normalized OLTP schemas is a recipe for constant query tuning.

The meta principle is simple: treat database IOPS like a scarce resource. Because at this scale, it is.

Operate as if reliability is a feature

Once you cross into tens of thousands of requests per second, scaling becomes an operational discipline.

Autoscaling must be tuned to avoid flapping. If you scale down too aggressively, you will oscillate between shortage and surplus. Stabilization windows and conservative scale-down policies are there for a reason.

Circuit breakers and outlier detection protect you from sick downstream instances. If one dependency starts timing out, you must shed traffic to it quickly before it drags down healthy peers.

If you only track one dashboard, make it this set:

p50, p95, p99 latency per endpoint
Error rates split by 4xx and 5xx
Saturation metrics such as thread pools and queue depth
Dependency latency and cache hit rate
Retry rate, which is your early warning for storms

Most large outages start with a small latency regression that triggers retries. Retries increase load. Load increases latency. Without controls, the system can DDoS itself.

FAQ

Do you need microservices to reach millions of requests per minute?
No. A well designed modular monolith can scale extremely far. Microservices help team autonomy but add network hops and operational complexity.

REST or gRPC at this scale?
Either can work. If you control clients and need efficient binary serialization and multiplexing, gRPC is attractive. If you need broad interoperability and simpler debugging, REST remains practical.

What is the most common scaling failure mode?
Retry amplification combined with missing timeouts.

How do you know you are ready?
Load test the entire system with realistic traffic shapes. Inject failures into dependencies. If you have never tested a partial database slowdown or cache outage, you do not truly know how the system behaves.

Honest Takeaway

Building APIs that handle millions of requests per minute is not about chasing a magic framework. It is about discipline.

You add governors everywhere: rate limits, bounded concurrency, timeouts, idempotency, backpressure, and graceful degradation. You move expensive work out of the request path. You treat databases as scarce resources. You observe everything.

If you do this well, your API does not look impressive under load. It looks boring.

And at 3 a.m., boring is exactly what you want.

Building APIs That Handle Millions of Requests

Do the math first, or you will guess wrong later

Treat failure as normal, not exceptional

Control the firehose at the edge with hard limits

Make request handling cheap and predictable

Scale the data layer by reducing database work

Operate as if reliability is a feature

FAQ

Honest Takeaway

Kirstie Sands

About Our Editorial Process

7 Refactor Patterns That Compound Over Years

DOJ Files Link Epstein To Russian Investors

Citadel, Ark Invest Buy ZRO Token

Design Dashboards That Surface Production Risks

The Essential Guide to Horizontal Compute Scaling

How Senior Engineers Detect Architectural Drift

How to Implement Rate Limiting in REST APIs

EPA Move Sparks Climate Policy Showdown

Lockheed Martin Unveils Lamprey Undersea Drone

How Text to Video AI Systems Work: Model Architecture, Diffusion Pipelines, and Scaling AI Video Generators

Tesla Issues Guidance Amid Delivery Slump

Netflix Seeks Approval for WBD Merger

Why Do Programmers Need Tools to Verify AI Code?

Best 7 Error Tracking Tools for Developers

The Expanding Link Between Software Engineering And Cyber Security

Instagram Adds Parent Alerts For Self-Harm Searches

AI Agents Need Restraints More Than Hype

AI Firms Shift To Full-Stack Hardware

How Telecom Startups Are Scaling Without Building Networks

Why Some AI Platforms Scale and Others Degrade

How to Scale Data Ingestion for Real-Time Analytics

How to Use Rate Limiting to Protect Services at Scale

Designing Systems That Scale Under Variable Load

High-Resolution Images Redraw Nova Timeline

API-Only AI: The Hidden Long-Term Risks

Building APIs That Handle Millions of Requests

Do the math first, or you will guess wrong later

Treat failure as normal, not exceptional

Control the firehose at the edge with hard limits

Make request handling cheap and predictable

Scale the data layer by reducing database work

Operate as if reliability is a feature

FAQ

Honest Takeaway

Related Posts

About Our Editorial Process