You do not optimize your way to millions of requests per minute. You architect your way there, then you operate your way there.
At that load, the enemy is not a single slow endpoint. It is death by a thousand paper cuts: one extra network hop, one chatty database call, one retry storm, one GC pause, one missing timeout. The trick is to design an API that stays boring under stress, where overload looks like graceful degradation instead of a cascading outage.
Let’s make this concrete. 1 million requests per minute is 16,667 requests per second. 5 million per minute is 83,333 rps. If your p95 budget is 200 ms end-to-end, you are managing tens of thousands of concurrent in-flight requests, plus retries, plus the fact that traffic is spiky and not a flat line.
Do the math first, or you will guess wrong later
Before you talk about microservices, gRPC, or service meshes, do a back-of-the-napkin capacity model. It forces you to confront real constraints: CPU, memory, connection limits, and the database.
A worked example:
- Target: 5,000,000 requests per minute = 83,333 rps
- Assume a single API instance can safely handle 2,000 rps at p95, after auth, routing, and serialization overhead.
- You need 83,333 / 2,000 = 41.7, call it 50 instances to cover headroom, uneven load, deployments, and a regional failure.
- If average payload is 2 KB in and 2 KB out, that is ~4 KB per request. At 83,333 rps, you are pushing ~333 MB per second, roughly 2.7 Gbps, before TLS overhead and spikes.
This is why autoscaling alone is not a strategy. It helps, but only if the rest of the system behaves under surge.
Treat failure as normal, not exceptional
This is where serious system design starts: you assume things will break, and you design the blast radius.
Werner Vogels, CTO at Amazon, has consistently argued that distributed systems must be designed with the expectation of failure. In Amazon’s architecture guidance, reliability is framed around isolating components and minimizing cascading impact. Failure is not an edge case, it is background noise.
Ben Treynor Sloss, who helped found Google’s SRE practice, made reliability a quantifiable engineering discipline. The SRE approach treats uptime and latency as measurable objectives with explicit tradeoffs, not vague aspirations. You set error budgets and build systems that respect them.
Stripe’s engineering team operationalized this mindset by baking idempotency into their API design. They introduced idempotency keys for write operations so clients can safely retry without duplicating side effects. That design decision directly addresses the messy reality of real networks.
Put those together, and you get a blunt synthesis: at a million-plus RPM, resilience is a product feature. Your API contract must define what happens when the system is slow, overloaded, or partially degraded.
Control the firehose at the edge with hard limits
Your first scaling win is preventing bad traffic from becoming expensive traffic.
A mature edge layer, often an API gateway or reverse proxy, should handle:
- TLS termination and authentication
- Routing and request normalization
- Rate limiting and quotas
- Timeouts and circuit breaking
- Structured logging and trace propagation
Rate limiting is not optional at this scale. It is load management.
Here is a practical comparison of common rate-limiting strategies:
| Strategy | Best for | Weakness under surge |
|---|---|---|
| Fixed window | Simple per-minute quotas | Burst at window boundaries |
| Sliding window | Fairer distribution | More state and computational complexity |
| Token bucket | Controlled bursts | Harder distributed coordination |
| Leaky bucket | Smooth steady throughput | Can introduce perceived latency |
Two rules matter more than the algorithm name.
First, fail closed for unsafe endpoints. If overload causes duplicate charges or duplicate writes, your problem is no longer performance.
Second, return fast 429 responses before upstream threads, connection pools, or databases saturate. Protect the core at all costs.
Make request handling cheap and predictable
At high RPS, you pay for every abstraction with CPU cycles and latency variance. Your goal is to make each request boring and inexpensive.
Practical levers that consistently move the needle:
- Keep connections warm and reuse them.
- Avoid excessive per-request allocations.
- Bound concurrency instead of allowing unlimited in-flight work.
- Set strict timeouts on every network hop.
- Retry only idempotent operations, with jitter and caps.
If you do write operations, build idempotency into the contract. That single design decision prevents entire classes of catastrophic retry amplification.
Also watch for hidden multipliers. Logging every request synchronously, performing per-request schema validation with heavy reflection, or making three downstream calls when one would do are common self-inflicted bottleneck.
Scale the data layer by reducing database work
Most APIs do not collapse in the stateless tier. They collapse when the database becomes the shared choke point.
Patterns that scale in practice:
Aggressive caching with explicit TTLs.
Cache at the edge where possible, then in memory, then fall back to the database. Decide what can be slightly stale and for how long.
Partitioning with a plan for hot keys.
If one user or object can become extremely popular, you need mitigation. That might mean key salting, request coalescing, or isolating hot tenants.
Asynchronous side effects.
Turn “write and perform five additional operations” into “write, enqueue, acknowledge.” Process the rest outside the critical path.
Dedicated read models for read heavy endpoints.
Search, feeds, and analytics endpoints often need denormalized stores. Serving them from normalized OLTP schemas is a recipe for constant query tuning.
The meta principle is simple: treat database IOPS like a scarce resource. Because at this scale, it is.
Operate as if reliability is a feature
Once you cross into tens of thousands of requests per second, scaling becomes an operational discipline.
Autoscaling must be tuned to avoid flapping. If you scale down too aggressively, you will oscillate between shortage and surplus. Stabilization windows and conservative scale-down policies are there for a reason.
Circuit breakers and outlier detection protect you from sick downstream instances. If one dependency starts timing out, you must shed traffic to it quickly before it drags down healthy peers.
If you only track one dashboard, make it this set:
- p50, p95, p99 latency per endpoint
- Error rates split by 4xx and 5xx
- Saturation metrics such as thread pools and queue depth
- Dependency latency and cache hit rate
- Retry rate, which is your early warning for storms
Most large outages start with a small latency regression that triggers retries. Retries increase load. Load increases latency. Without controls, the system can DDoS itself.
FAQ
Do you need microservices to reach millions of requests per minute?
No. A well designed modular monolith can scale extremely far. Microservices help team autonomy but add network hops and operational complexity.
REST or gRPC at this scale?
Either can work. If you control clients and need efficient binary serialization and multiplexing, gRPC is attractive. If you need broad interoperability and simpler debugging, REST remains practical.
What is the most common scaling failure mode?
Retry amplification combined with missing timeouts.
How do you know you are ready?
Load test the entire system with realistic traffic shapes. Inject failures into dependencies. If you have never tested a partial database slowdown or cache outage, you do not truly know how the system behaves.
Honest Takeaway
Building APIs that handle millions of requests per minute is not about chasing a magic framework. It is about discipline.
You add governors everywhere: rate limits, bounded concurrency, timeouts, idempotency, backpressure, and graceful degradation. You move expensive work out of the request path. You treat databases as scarce resources. You observe everything.
If you do this well, your API does not look impressive under load. It looks boring.
And at 3 a.m., boring is exactly what you want.
Kirstie a technology news reporter at DevX. She reports on emerging technologies and startups waiting to skyrocket.
























