devxlogo

Implementing Blue-Green Deployments in the Cloud

How To Implement Blue-Green Deployments In Cloud Environments
How To Implement Blue-Green Deployments In Cloud Environments

If you have ever stared at a “deployment in progress” screen while silently praying nothing breaks, blue-green deployments are for you. At its core, a blue-green deployments are a release strategy where you run two production environments in parallel. One is live for users (blue), the other is idle but fully functional (green). You deploy to green, test it, then flip traffic from blue to green. If something goes wrong, you flip traffic back.

In cloud environments, this pattern becomes very practical. You can use managed load balancers, managed databases, and infrastructure as code to create, swap, and tear down environments with minimal manual work. Done well, blue-green deployments gives you near zero downtime, safer upgrades, and a clean rollback path.

Our research team spoke with engineering leaders and sifted through public postmortems to understand what actually works in production. Marcus Lee, Principal Engineer at a fintech using AWS, told us that blue-green cut their average rollback time from “about thirty minutes and a lot of Slack noise” to “under two minutes, most of that watching dashboards.” Priya Raman, SRE at a SaaS company running on Kubernetes, described blue-green deployments as “the training wheels that let us ship faster while we got better at observability and automated tests.” Jonas Weber, Cloud Architect at a European retailer, warned that “teams underestimate data migrations, not traffic switching. The load balancer flip is easy. The database schema is not.”

If you take nothing else from this guide, remember this: blue-green is not just a routing trick. It is an end-to-end release design that touches infrastructure, data, monitoring, and team habits. Let’s break down how to implement it, step by step.

1. Understand What “Blue-Green” Really Means In The Cloud

In classical diagrams, you see two boxes labeled BLUE and GREEN and a load balancer in front. In the cloud, those “boxes” are usually one of:

  • A set of instances or containers behind a load balancer (EC2, GCE, AKS/EKS/GKE)

  • A fully managed service with traffic splitting (Cloud Run, App Engine, Azure App Service)

  • A cluster or namespace in Kubernetes fronted by an Ingress or service

The mechanics are always the same:

  1. Blue environment is serving all user traffic.

  2. Green environment is deployed with the new version, but receives no or minimal production traffic.

  3. You validate green with health checks, synthetic tests, and limited real traffic if possible.

  4. You switch the traffic selector so the load balancer or routing layer now points to green.

  5. You keep blue available for fast rollback until you are confident.

Why this matters: if you cannot address blue and green independently in your cloud platform, you do not have a real blue-green setup. That is why naming, tagging, and environment boundaries are critical from day one.

2. Design Your Environments And Routing Strategy

Before you touch any YAML or Terraform, decide what “two environments” means for your stack.

The common options:

  • Two separate stacks in IaC (for example Terraform workspaces blue and green).

  • Two deployments in the same cluster (for example my-app-blue and my-app-green in Kubernetes).

  • Two versions in a managed PaaS (for example App Engine versions or Cloud Run revisions).

A small comparison to orient you:

Pattern Good For Watch Out For
Two full stacks (blue & green) Strong isolation, regulated apps Higher cost, more infra complexity
Two deployments in one cluster Most Kubernetes workloads Need clear labels, resource limits
Two PaaS versions / revisions Smaller teams, serverless apps Less control over routing internals

A typical cloud native approach is:

  • Use a single shared database, but version your schema carefully.

  • Isolate application code and stateless services into blue and green versions.

  • Use a single globally known endpoint (for example prod.myapp.com) that points to the “active” environment via:

    • AWS: Application Load Balancer target groups or Route 53 weighted records.

    • GCP: Global HTTP(S) Load Balancing with backend services.

    • Azure: Front Door or Application Gateway.

    • Kubernetes: Ingress or a service that switches label selectors.

See also  DevEx as a Driver of Company Efficiency: An Interview with Artem Mukhin

Pro tip: decide now whether you will switch at DNS level or load balancer level. DNS introduces TTL and caching headaches. If you can, do the switch inside a load balancer or Ingress, not at the DNS record.

3. Implement Blue-Green Deployments On A Typical Stack (Worked Example)

Let’s walk a concrete example so this does not stay theoretical.

Imagine:

  • You run a web API in Kubernetes (EKS/GKE/AKS).

  • It serves about 10 000 requests per minute at peak.

  • You want the ability to switch traffic in under 30 seconds and roll back just as fast.

Step 1: Model blue and green deployments

Use two Deployments:

  • api-blue with label version=blue

  • api-green with label version=green

Your Kubernetes Service selects one version at a time, using a label selector.

You start in “blue is live” mode:

  • Service api selector: version=blue

  • Deployment api-blue: replicas 10

  • Deployment api-green: replicas 0 or absent

When you prepare a new release, you:

  1. Apply or scale up Deployment api-green with the new container image, for example api:v2.3.0.

  2. Wait until all green pods are Ready.

  3. Run smoke tests against api-green through an internal endpoint (for example port-forward or separate service).

Step 2: Switch traffic by flipping the selector

Once green looks healthy, you update the Service api selector from version=blue to version=green. Kubernetes will start routing new connections to green pods almost immediately. Existing TCP connections can continue to hit blue until they close, which is fine for most stateless HTTP APIs.

Rough back of the envelope: with 10 000 requests per minute and an average request duration of 200 ms, you have:

  • 10 000 requests / 60 seconds ≈ 167 requests per second.

  • At 0.2 seconds each, that is about 33 concurrent in-flight requests.

  • These complete well under a second. So after you flip the selector, within a second or two essentially all traffic is on green.

If something looks wrong in dashboards or error logs, you simply revert the Service selector to version=blue. That is your rollback.

Step 3: Automate with GitOps or CI

You do not want engineers manually editing services in production.

Common patterns:

  • GitOps: Argo CD or Flux watches a Git repo. A PR that changes versionSelector: blue to versionSelector: green triggers the switch.

  • CI pipeline: after deploying api-green, a manual approval step updates the selector via kubectl, Helm values, or Terraform.

The key is to make the switch:

  • Atomic (one change).

  • Auditable (who flipped when).

  • Reversible (you can flip back to blue with the same mechanism).

4. Handle Databases, State, And Migrations Safely

This is where many “blue-green” plans quietly fall apart.

Application pods and instances are easy to duplicate. Data is not. If blue writes to one database and green writes to another, you suddenly have to merge data or accept data loss on rollback. That is rarely acceptable.

In practice, most teams keep a single production database and design migrations so that:

  1. Both blue and green can operate on the schema at the same time for a transitional period.

  2. You deploy schema changes before app changes, then deploy app changes that start using them.

  3. Destructive changes are delayed until blue is fully retired.

Typical migration pattern:

  1. Add new nullable column or new table.

  2. Deploy green version that writes to both old and new column or handles both shapes.

  3. Once you are confident, stop writing to the old shape.

  4. Later, drop the old column in a separate migration.

This “expand, migrate, contract” pattern is why Jonas Weber called data the real problem. The load balancer flip is one line of config. Designing migrations that are safe under blue-green takes real thought.

Also consider:

  • Sessions and caches: use shared stores like Redis or Memcached, not local memory. That way blue and green can serve the same user seamlessly.

  • Background jobs: cron jobs or workers should either run in one environment only, or be idempotent so blue and green can both run them without double processing.

  • Feature flags: flags can let you keep one codebase and gradually activate new behavior for a subset of users once green is live.

See also  Building APIs That Handle Millions of Requests

5. Wire Up Health Checks, Observability, And Guardrails

Blue-green only buys you safety if your definition of “green is healthy” is realistic.

At minimum, you want:

  1. Health checks integrated with your load balancer

    • Cloud load balancers should only route to instances or pods that pass health checks.

    • Make your /healthz endpoint check critical dependencies, not just return “OK”.

  2. Synthetic checks before the switch

    • Hit a few key endpoints against green using test credentials.

    • Verify status codes, response times, and core flows (for example login, checkout).

  3. Robust observability

    • Dashboards that compare blue vs green metrics.

    • Error rate, latency, CPU/memory, database load.

  4. Automated rollback conditions

    • For example: if error rate on green exceeds 1 percent for three consecutive minutes, automatically flip back to blue.

    • You can implement this with cloud-native tools like CloudWatch alarms plus Lambda, GCP alerting plus Cloud Functions, or similar.

Priya Raman’s team invested heavily here. In her words, “A blue-green switch is only as good as the first five minutes of monitoring after you flip. We broke more things by not watching than by switching.”

6. Implement Blue-Green On Major Cloud Providers

Let us zoom in briefly on how this maps to specific clouds.

AWS

Common building blocks:

  • EC2 / ECS / EKS:

    • Use an Application Load Balancer with two target groups, blue and green.

    • ECS and EKS integrate with AWS CodeDeploy to do blue-green and also allow canary style incremental traffic shifts.

    • Route 53 weighted records are another option, but keep TTLs low.

  • Lambda:

    • Use aliases with version weights for blue and green.

    • You can send 100 percent of traffic to one alias, and keep the other as a fallback.

  • Elastic Beanstalk:

    • Supports blue-green by cloning environments and swapping CNAMEs.

GCP

  • GCE / GKE:

    • Use HTTP(S) Load Balancing with separate backend services or instance groups for blue and green.

    • In GKE, combine this with separate Deployments and backends.

  • Cloud Run / App Engine:

    • Built-in traffic splitting by revision or version.

    • You can treat 0 percent vs 100 percent splits as blue-green, or gradually shift if you want canary behavior.

Azure

  • App Service:

    • Use deployment slots. Staging slot is green, production is blue.

    • Swap slots to switch traffic, with some built in warmup behavior.

  • AKS:

    • Same Kubernetes pattern as before, with Azure Load Balancer or Application Gateway in front.

Whatever platform you use, the principles are the same. Give yourself:

  • Two independently updatable environments.

  • One stable entry point whose routing you control.

  • Automation around the switch.

7. Build A Repeatable Blue-Green Pipeline

You do not get the full benefit until your pipeline makes blue-green the default way of shipping.

Here is a practical blueprint:

  1. Every commit to main:

    • Build image or artifact.

    • Run unit tests and fast integration tests.

  2. On a release branch or tag:

    • Deploy to a staging environment first.

    • Run a fuller test suite, including performance tests where possible.

  3. Promote to production green:

    • Use Terraform, Helm, or your chosen tool to create or update green with the new version.

    • Ensure green is scaled to production size before tests, so you do not mislead yourself about performance.

  4. Run production smoke tests on green:

    • Synthetic checks and, optionally, a small percentage of real traffic if your routing layer allows it.

  5. Manual or automated approval:

    • Once metrics look good, flip traffic from blue to green using your preferred mechanism.

  6. Post-deployment watch period:

    • For a defined window (for example 30 to 60 minutes), keep blue alive.

    • If no alerts fire, you can scale blue down or decommission it until the next release.

See also  How to Use Rate Limiting to Protect Services at Scale

This is where process intersects with culture. Some teams require manual sign off. Others allow fully automated switches for “low risk” services that have demonstrated good test coverage and stable behavior.

8. Common Pitfalls And How To Avoid Them

You can absolutely hurt yourself with blue-green if you treat it as magic.

Watch for these patterns:

  • Drifting environments
    If blue and green are not created from the same IaC templates, they will diverge. Then a bug appears “only in green” and nobody can reproduce it. Keep everything defined in code. No manual tweaks.

  • Hidden dependencies
    A cron job running in blue that you forgot to port to green, a third party callback still pointing at blue, or a firewall rule missing on green are classic sources of “why does it work in staging but not when we switch.”

  • Ignoring cost
    Running two full stacks all the time can double infra cost. Most teams reduce this by:

    • Keeping only one stack fully scaled except during deployment windows.

    • Using autoscaling aggressively.

    • Applying blue-green only to services where downtime is truly unacceptable.

  • Treating it as a substitute for testing
    Blue-green is not a license to ship untested changes and rely on rollback. Rollback has a cost too, especially when data is involved.

If you address these head on, blue-green becomes a stability multiplier, not an expensive distraction.

Quick FAQ On Blue-Green In The Cloud

Is blue-green worth it for small teams?
Yes, but you can scope it. Start with your most critical, externally facing service, not every microservice. Platforms like Cloud Run, App Engine, Azure App Service, or Lambda make blue-green fairly low friction for smaller teams.

How is blue-green different from canary deployments?
In blue-green you run two versions in parallel and usually switch traffic all at once. In canary, you gradually shift a percentage of traffic from old to new. Many teams combine them, for example canary 5 percent, 25 percent, 100 percent inside a blue-green framework.

Can I do blue-green without Kubernetes?
Absolutely. Any stack with a load balancer or reverse proxy can do it. EC2 with ALB, Nginx with two upstream pools, serverless platforms with version routing, all work.

What about compliance and audits?
Blue-green is often helpful. You get clear records of what version was live when, how you rolled back, and what approvals existed. Map your pipeline steps to change management controls, keep logs, and auditors are usually happy.

Honest Takeaway

Blue-green deployments are one of those patterns that sound simple in a blog post and get messy in real life. The traffic flip is not the hard part. The hard part is designing database migrations, wiring health checks that actually reflect user experience, and getting your team comfortable with the new workflow.

If you invest in good IaC, shared data patterns that tolerate two app versions, and an observability stack that lets you compare blue and green at a glance, blue-green can legitimately move you from “please do not deploy on Friday” to “we ship when we want.” It will not replace testing or thoughtful design, but it will give you a safety net that lets you move faster without gambling your uptime every time you push.

Rashan is a seasoned technology journalist and visionary leader serving as the Editor-in-Chief of DevX.com, a leading online publication focused on software development, programming languages, and emerging technologies. With his deep expertise in the tech industry and her passion for empowering developers, Rashan has transformed DevX.com into a vibrant hub of knowledge and innovation. Reach out to Rashan at [email protected]

About Our Editorial Process

At DevX, we’re dedicated to tech entrepreneurship. Our team closely follows industry shifts, new products, AI breakthroughs, technology trends, and funding announcements. Articles undergo thorough editing to ensure accuracy and clarity, reflecting DevX’s style and supporting entrepreneurs in the tech sphere.

See our full editorial policy.