You notice it first in the graphs. CPU spikes that look like a heart monitor. Latency is creeping up just enough to make your SRE instincts twitch. A few minutes later, your on-call phone lights up. Traffic surged again, and this time the deployment did not keep up.
Horizontal Pod Autoscaling exists to make that moment boring.
At a plain-language level, horizontal pod autoscaling (HPA) is Kubernetes automatically changing the number of running pods for a workload based on real usage. Instead of guessing how many replicas you need at deploy time, you give Kubernetes a target like “keep CPU around 60 percent,” and it continuously adjusts replica count to hit that goal.
This is not magic, and it is not predictive AI. It is a control loop that measures metrics, compares them to a target, and scales pods up or down. When it works well, you stop thinking about scale during normal traffic swings. When it is misconfigured, you get flapping, cold starts, or wasted compute.
Let’s walk through how HPA actually works under the hood, what it does well, where it falls, and how experienced teams use it in production.
What Horizontal Pod Autoscaling really does (and what it does not)
In Kubernetes, HPA is implemented by the Kubernetes Horizontal Pod Autoscaler controller. Its only job is to adjust the replicas field of a scalable resource like a Deployment, StatefulSet, or ReplicaSet.
It does not:
-
Add nodes to your cluster.
-
Resize pods vertically.
-
Anticipate traffic spikes on its own.
It does:
-
Observe metrics on existing pods.
-
Calculate a desired replica count.
-
Update the workload spec to match that number.
If you also want nodes to scale, that is a separate controller called Cluster Autoscaler. If you want pod CPU or memory limits to change, that is the Vertical Pod Autoscaler. HPA sits squarely in the middle of that stack.
The control loop, step by step
HPA runs as a reconciliation loop, similar to most Kubernetes controllers. Roughly every 15 seconds by default, it goes through the following process.
First, it pulls metrics. Most commonly, these come from the Kubernetes Metrics API, which is usually backed by Metrics Server. For more advanced use cases, it can pull from custom or external metrics pipelines.
Second, it calculates the average value across all eligible pods. For CPU, that is typically the current CPU usage divided by the requested CPU. For memory, it is raw usage. For custom metrics, it depends on how you define them.
Third, it compares the observed value to your target. If your target CPU utilization is 60 percent and the observed average is 90 percent, Kubernetes computes a scale factor.
Fourth, it updates the replica count using this formula:
desiredReplicas = currentReplicas × ( currentMetric / targetMetric )
Finally, it applies bounds. Minimum and maximum replicas are enforced, and scaling behavior policies may further limit how fast changes happen.
That is it. No machine learning. No smoothing unless you configure it. Just math and guardrails.
Metrics, the real heart of HPA
HPA is only as good as the metrics you feed it. In practice, there are three classes of metrics teams use.
Resource metrics are the default. CPU utilization is by far the most common. It works well for CPU-bound services with stable request patterns. Memory-based autoscaling is trickier and often dangerous because memory does not drop quickly under load.
Custom metrics come from inside your application. Think requests per second, queue depth, or active connections. These are usually exposed via Prometheus and consumed by HPA through an adapter. This is where autoscaling starts to feel intentional rather than generic.
External metrics come from outside the cluster. Examples include cloud queue length, Kafka lag, or managed load balancer metrics. These are powerful for event-driven systems but add operational complexity.
A recurring lesson from real systems is that CPU is easy but often wrong. Teams that invest in application-level metrics tend to get smoother scaling with fewer surprises.
What experienced practitioners say about HPA
In conversations with platform engineers and SREs, a few themes come up again and again.
Tim Hockin, Kubernetes co-founder (Google), has emphasized in talks and interviews that autoscaling is fundamentally a feedback system. If your signal lags reality or your workload has slow startup times, no autoscaler can save you. You must design services that scale quickly and predictably.
Kelsey Hightower, Staff Developer Advocate (Google), has repeatedly pointed out that HPA exposes poor resource requests. If your CPU requests are wildly off, HPA calculations become meaningless. Autoscaling forces you to confront capacity planning instead of hiding it.
Liz Rice, Chief Open Source Officer (Isovalent), has noted that many teams treat HPA as fire-and-forget. In practice, it needs observability. You should graph desired replicas versus actual replicas and understand why they diverge.
Taken together, the message is consistent. HPA is reliable, but only when paired with good metrics, sane requests, and fast-starting pods.
A concrete example with numbers
Imagine a Deployment running 4 pods. Each pod requests 500 millicores of CPU. You configure HPA with a target average CPU utilization of 60 percent.
At some point, metrics show:
-
Average CPU usage per pod is 450 millicores.
-
That is 90 percent of the requested CPU.
HPA computes:
desiredReplicas = 4 × (90 / 60) = 6
Kubernetes updates the Deployment to 6 replicas. New pods spin up. If those pods start quickly and handle traffic, average CPU drops back toward the target.
If the startup takes 90 seconds and traffic keeps rising, HPA may scale again before the system stabilizes. This is where scale policies and cooldown windows matter.
How to set up HPA without hurting yourself
Here is a high-level, production-oriented approach that works for most teams.
Step 1: Fix resource requests first
Before enabling autoscaling, ensure CPU and memory requests reflect reality. HPA uses requests as its baseline. Bad requests equal bad scaling.
Step 2: Start with CPU, but validate
CPU-based HPA is the simplest place to begin. Watch how it behaves during real traffic. Validate that scaling events correlate with load, not noise.
Step 3: Add scale behavior controls
Use behavior.scaleUp and behavior.scaleDown to prevent flapping. Limit how many pods can be added or removed per minute. This single change often stabilizes systems dramatically.
Step 4: Graduate to custom metrics
Once you understand your traffic patterns, scale on something closer to user demand, like requests per second or queue depth. This aligns scaling with business reality.
Step 5: Observe, then tune
Graph desired replicas, actual replicas, and metrics together. Autoscaling is not a one-time config. It is an operational surface.
Common failure modes to watch for
HPA failures are usually silent until they are expensive.
One common issue is scale lag. Traffic spikes faster than pods can start, leading to errors before scaling catches up. Pre-warming or predictive scaling can help, but often the fix is simply faster startup times.
Another is metric noise. Spiky metrics cause oscillation. Smoothing at the metrics layer or adding scale-down delays usually fixes this.
The most dangerous failure is memory-based scaling without limits. Memory rarely drops under pressure, so scale-down never happens. Clusters bloat quietly until someone notices the bill.
FAQs teams actually ask
Does HPA work with StatefulSets?
Yes. HPA can scale StatefulSets, but only when replicas are interchangeable. Stateful workloads with heavy per-pod state often behave poorly when scaled horizontally.
How fast does HPA react?
By default, it evaluates every 15 seconds. Real-world reaction time also depends on metrics’ freshness and pod startup latency.
Should every service use HPA?
No. Batch jobs, cron workloads, and latency-insensitive systems often do not benefit. HPA shines for request-driven services.
The honest takeaway
Horizontal Pod Autoscaling is one of Kubernetes’ most valuable primitives because it turns real usage into concrete capacity decisions. It is simple enough to understand and powerful enough to run large production systems.
But it is not a set-and-forget infrastructure. HPA reflects the quality of your metrics, your resource modeling, and your application design. If those are sloppy, autoscaling will faithfully amplify the sloppiness.
When done well, HPA quietly absorbs traffic spikes and shrinks during lulls, and you stop thinking about replicas entirely. That boredom is the success case.
Kirstie a technology news reporter at DevX. She reports on emerging technologies and startups waiting to skyrocket.





















