devxlogo

How to Scale WebSocket Connections in Production

How to Scale WebSocket Connections in Production
How to Scale WebSocket Connections in Production

Your first WebSocket feature usually ships as a small miracle. One server, a handful of clients, and suddenly your product feels alive. Then production traffic shows up with its own hobbies: mobile networks that nap mid-handshake, load balancers that quietly reap “idle” TCP sessions, and one customer who opens your dashboard on eight monitors because “it looks cool.”

Scaling WebSockets in production is not “scale your HTTP API, but keep the socket open.” It is a different operational animal. WebSockets are long-lived, stateful connections that pin memory, file descriptors, and per-connection bookkeeping to your fleet. When you scale the number of users, you scale the number of concurrent open connections, not just requests per second. That difference changes everything: load balancing strategies, autoscaling signals, message fan-out architecture, and even how you do deploys.

The goal is simple to say and annoyingly hard to do: keep connections stable, route messages to the right client fast, and survive failures without a reconnect storm taking your system down.

What practitioners keep repeating once you ask about real scale

After reading through engineering blogs and postmortems from teams running large real-time systems, a few themes keep repeating, even when their stacks differ.

Sameera Thangudu, Senior Software Engineer at Slack, has described how Slack built dedicated gateway servers that terminate WebSocket connections and hold user state. They rely heavily on consistent hashing to route traffic predictably across internal services. The subtext is clear: connection ownership and routing are architectural decisions, not incidental details.

Austin Whyte, Software Engineer on Discord’s Realtime Infrastructure team, has explained how long-lived gateway connections make traditional autoscaling awkward. When users stay connected for hours, scaling down is not trivial. Discord focused on squeezing bandwidth and memory efficiency to reduce instability at peak load. Efficient work became available work.

Jo Stichbury, CTO at Ably, has consistently emphasized that WebSockets themselves are not the scaling problem. The real challenge is lifecycle management, fallback handling, load balancing behavior, and operational resilience at scale.

Put together, these perspectives point to the same reality: scaling WebSockets is less about adding servers and more about designing predictable connection ownership plus a reliable message distribution backbone.

See also  The Circuit Breaker Pattern in Modern Systems

The constraints that cap you first

Before architecture diagrams, understand what usually breaks first.

1) Idle timeouts you did not know you had.
Most cloud load balancers have default idle timeouts. If your app stays quiet for a minute and you do not send heartbeats, connections get dropped. In practice, teams learn this only after seeing periodic, mysterious disconnects in production.

2) Per-connection resource burn.
Every connection costs you at least:

  • A file descriptor
  • Kernel socket buffers
  • TLS state is terminated in-process
  • Application-level connection state, such as auth and subscriptions

Here is a simple capacity estimate you can adapt:

  • Assume your application holds 35 KB per connection.
  • Assume kernel and TLS overhead add roughly 25 KB.
  • The total is about 60 KB per connection.

If you target 150,000 concurrent connections per node:

150,000 × 60 KB = 9,000,000 KB
9,000,000 KB ÷ 1,024 ≈ 8,789 MB

That is roughly 8.6 GB of RAM just to keep sockets alive, before real business logic runs.

When you run that math honestly, it forces clarity. Either reduce per-connection state, or shard aggressively, or both.

3) Upgrade handling and proxy limits.
WebSockets depend on HTTP Upgrade semantics. Not every proxy handles that path identically. If you use Envoy, NGINX, or a managed ingress controller, you must explicitly configure support for upgrades and long-lived connections. Defaults are usually tuned for short-lived HTTP requests, not hours-long sessions.

Pick an architecture that matches your message patterns

At scale, two patterns show up repeatedly.

A. Dedicated connection gateways plus a message backbone.
A fleet of gateway servers terminates WebSockets, tracks subscriptions, and pushes messages to clients. Other services publish events into a broker or pub/sub layer. Gateways subscribe and fan out to relevant connections.

This is the most common pattern in large systems. It decouples message production from connection management and gives you clean scaling boundaries.

B. Sticky sessions with per-node local state.
Session affinity ensures a user reconnects to the same node. This simplifies early development because the connection state lives locally. The downside appears during failures and rebalancing. Reconnect storms can concentrate load, and scaling down is harder because connections are pinned.

If you want a safe long-term direction, build toward dedicated gateways and shared messaging infrastructure, even if you begin with stickiness.

See also  How to Run Load Tests That Reflect Real Users

How to scale it in practice

Step 1: Make connection ownership explicit

Decide which tier owns WebSocket connections and what state lives there.

Keep per-connection state minimal and reconstructable. Treat subscription state as data that can be rehydrated after reconnect. If you use consistent hashing, you get predictable routing and reduced churn when nodes scale up or down.

The key mindset shift: a WebSocket connection is not “just a socket.” It is a stateful contract that must survive partial failures.

Step 2: Put a real fan-out backbone behind your gateways

Once you have more than one gateway, broadcasting means coordinating across instances.

Most teams choose between lightweight pub/sub systems and durable event logs. The right answer depends on your guarantees.

Backbone Strength Tradeoff Typical Use
Redis Pub/Sub Simple and fast No durable replay Ephemeral signals
NATS Very low latency Requires app-level replay strategy High fan-out events
Kafka Durable and replayable Operational overhead Audit trails, rebuildable state

For chat or collaboration systems, you often combine approaches: a fast real-time path plus a durable store of record. The WebSocket delivers immediacy, not durability.

Step 3: Treat heartbeats and reconnects as product features

Assume something in the network path will close idle connections.

Design for:

  • Server pings every 20 to 30 seconds
  • Client pong responses
  • Exponential backoff with jitter on reconnect
  • Hard caps on per-connection outbound queue size

That last item is critical. A slow client should not consume unbounded memory on your node. Backpressure must be deliberate.

Reconnect storms deserve special planning. When a region blips and 200,000 clients reconnect at once, your authentication and subscription systems feel it first. Rate-limit reconnect attempts, and cache auth tokens when possible.

Step 4: Scale your ingress like it matters, because it does

Most production disconnects are misconfigured ingress layers.

Tune:

  • Idle timeouts
  • Maximum concurrent connections
  • Buffer sizes
  • Upgrade support

If you are in Kubernetes, treat ingress configuration as part of your application code. Version it, test it, and load test it. Defaults are rarely production-ready for high-concurrency WebSockets.

Step 5: Operate it like a stateful system

CPU is rarely the primary bottleneck. Concurrency and memory are.

See also  The Complete Guide to Optimizing ORMs For Production Use

Metrics that matter more than request rate:

  • Concurrent connections per node
  • Reconnect rate per minute
  • p99 publish-to-delivery latency
  • Outbound queue depth per connection
  • Event loop lag or scheduler pressure

Long-lived connections complicate autoscaling. Scaling down requires draining connections gracefully. Scaling up may require rebalancing without mass disconnects. Plan deployment strategies that rotate nodes slowly and intentionally.

FAQ

Do I need sticky sessions for WebSockets?

No. They simplify early development but complicate rebalancing and failure recovery. Many mature systems move toward stateless gateways with shared state to allow any node to accept reconnects.

How many connections can one server handle?

It depends on memory per connection, runtime efficiency, kernel limits, and message throughput. Always calculate memory headroom first, then load test with realistic message rates, not just idle sockets.

Why do connections die randomly in production?

Most often because of idle timeouts in load balancers or proxies. Heartbeats and correctly tuned ingress settings usually resolve this.

Should I build this or use a managed real-time provider?

If real-time infrastructure is not your core differentiation, managed services can reduce operational burden. If you need deep control over guarantees, latency, or data residency, building in-house may be justified, but it comes with ongoing operational costs.

Honest Takeaway

Scaling WebSockets in production forces you to think in concurrent connections, not requests per second. Once you adopt that mindset, the priorities sharpen: minimize per-connection state, build a reliable message backbone, implement disciplined heartbeats, and treat ingress configuration as critical infrastructure.

The easy path is sticky sessions and a single-region cluster. The resilient path is explicit connection ownership, stateless gateways, and a messaging layer you trust under peak load.

If you do the memory math early and design for reconnect storms before they happen, you avoid the midnight page that every real-time team eventually earns.

steve_gickling
CTO at  | Website

A seasoned technology executive with a proven record of developing and executing innovative strategies to scale high-growth SaaS platforms and enterprise solutions. As a hands-on CTO and systems architect, he combines technical excellence with visionary leadership to drive organizational success.

About Our Editorial Process

At DevX, we’re dedicated to tech entrepreneurship. Our team closely follows industry shifts, new products, AI breakthroughs, technology trends, and funding announcements. Articles undergo thorough editing to ensure accuracy and clarity, reflecting DevX’s style and supporting entrepreneurs in the tech sphere.

See our full editorial policy.