Schema Evolution in Large Engineering Teams

You probably have a scar story.

A downstream service crashes at 2 a.m. because a “harmless” field was renamed. A data warehouse job silently drops a column, and no one notices until the CFO’s dashboard looks off. Or worse, two teams deploy on different cadences and spend a week arguing about whose contract was “right.”

Schema evolution is the disciplined process of changing data structures over time without breaking the systems that depend on them. In small teams, it is a Slack message and a quick redeploy. In large engineering orgs, it is a coordination problem disguised as a JSON file.

If you treat schema changes as code changes, with ownership, review, testing, and observability, you can scale. If you treat them as incidental details, they will eventually take your platform down.

What Experienced Teams Are Actually Doing

We reviewed public guidance from teams that operate at scale, and the patterns are remarkably consistent.

Martin Kleppmann, author of Designing Data-Intensive Applications, has long emphasized compatibility rules in schema systems like Avro and Protobuf. His core point is simple: design your schemas so new versions can coexist with old ones, because distributed systems upgrade slowly and unevenly.

Gwen Shapira, data streaming expert at Confluent, frequently highlights that most “Kafka problems” are actually schema and contract problems. In her talks and writing, she stresses enforcing schema compatibility at the registry level instead of relying on developers to “remember the rules.”

Charity Majors, CTO at Honeycomb, often argues that observability is about understanding unknown unknowns. In the context of schema evolution, that translates into instrumenting producers and consumers so you can see when fields are missing, defaulted, or ignored, before customers feel it.

The synthesis is sobering. Mature teams do three things well:

They enforce compatibility in tooling, not in tribal knowledge.
They treat schemas as first-class versioned artifacts.
They observe how schemas behave in production, not just in CI.

Let’s unpack how to do that.

Understand the Three Types of Compatibility

Before you pick tools, you need a shared mental model. Most schema systems define compatibility along three axes.

Compatibility Type	What It Protects	Safe Change Example
Backward	New consumers reading old data	Add an optional field
Forward	Old consumers reading new data	Add a field with a default
Full	Both directions	Add an optional field with a default

If you use Avro with a schema registry, backward compatibility means a new schema can read data written with the previous schema. In Protobuf, adding a new field with a new tag is generally safe because unknown fields are ignored.

What breaks things?

Removing a required field
Changing a field type incompatibly, like int to string
Reusing a field number in Protobuf

Here is a minimal example in Avro that preserves backward compatibility by adding a field with a default:

Your first job as a large team is to agree on which compatibility mode is non-negotiable. Many organizations default to backward compatibility for event streams and full compatibility for shared APIs.

Make Schemas First-Class, Not Side Effects

In struggling teams, schemas live inside service repos as incidental code artifacts. In healthy teams, schemas are versioned assets with ownership and review.

Here is what that looks like in practice.

1. Centralize with a Registry

Use a schema registry, not a shared wiki.

Common choices:

Confluent Schema Registry for Avro, Protobuf, JSON Schema
AWS Glue Schema Registry
Apicurio Registry

The registry enforces compatibility rules at publish time. If a producer tries to register an incompatible schema, it fails fast.

This moves schema validation from “hope and PR comments” to automated enforcement.

2. Assign Explicit Ownership

Every schema should have:

A clearly defined owner team
A Slack channel or on-call alias
A documented lifecycle policy

If your org uses a platform model, the platform team can own the registry, but domain teams should own their domain schemas. Without ownership, evolution turns into negotiation by committee.

3. Version Intentionally

Avoid embedding version numbers in topic names or endpoints unless you are forced to. Instead:

Let the registry manage schema versions.
Keep topics stable when possible.
Use semantic versioning in public APIs where breaking changes are unavoidable.

When you must break compatibility, treat it like a migration, not a refactor. Create a new topic or endpoint, dual-write temporarily, and deprecate gradually.

Design for Additive Change

Large systems evolve best when most changes are additive.

That means:

Add fields, do not remove them.
Deprecate fields before deletion.
Make new fields optional with sensible defaults.

In Protobuf, never reuse field numbers. Mark deprecated fields explicitly:

Deprecation signals intent. Deletion should come later, after telemetry confirms no active consumers rely on the field.

This is where observability becomes operationally critical.

Instrument Schema Usage in Production

You cannot manage what you cannot see.

At scale, consumers are not just services you remember. They are cron jobs, notebooks, third-party integrations, and forgotten experiments.

Here is how advanced teams reduce risk:

Log schema version used by each producer and consumer.
Emit metrics for deserialization errors and defaulted fields.
Track field-level usage in downstream pipelines, especially in analytics stacks.

If you use Kafka, you can tag messages with schema IDs and build dashboards that show which versions are active. If a deprecated field shows zero usage for 30 days, you have evidence for safe removal.

This aligns with the observability mindset advocated by leaders like Charity Majors. You want early signals, not postmortems.

Build a Repeatable Migration Playbook

Eventually, you will need a breaking change. The difference between chaos and control is whether you have a playbook.

A pragmatic four-step approach works well in large orgs:

Step 1: Propose and Review the Contract Change

Write a short design doc that covers:

What is changing
Compatibility impact
Rollout plan
Rollback plan

Have consumer teams sign off if the change affects them. This avoids “surprise breakage” later.

Step 2: Introduce the New Schema in Parallel

If breaking, create a new versioned endpoint or topic. For events:

Dual-write old and new formats.
Allow consumers to migrate at their own pace.

For APIs:

Introduce /v2 while keeping /v1 stable.

Yes, this increases temporary complexity. It also protects uptime.

Step 3: Monitor and Enforce Deadlines

Set a deprecation window, for example, 60 or 90 days. Use metrics to track which consumers still depend on the old schema.

Escalate based on data, not guesswork.

Step 4: Remove the Old Path Cleanly

Once no active consumers remain, remove:

Old fields
Old topics or endpoints
Registry versions if appropriate

Document the change. Future engineers will thank you.

Align Incentives Across Teams

Most schema failures are social failures.

Producers want to move fast. Consumers want stability. Platform teams want governance. Without alignment, you get shadow topics and undocumented contracts.

A few structural decisions help:

Make compatibility checks mandatory in CI.
Tie the schema review to the same rigor as API review.
Include schema breakage incidents in postmortems.

When a breaking schema change causes an outage, treat it like any other reliability incident. Analyze not just the code, but the process gap.

FAQ

Should we prefer Avro, Protobuf, or JSON Schema?

It depends on your ecosystem. Protobuf works well for strongly typed service-to-service communication. Avro is popular in streaming and analytics contexts. The key is not the format, but enforcing compatibility with a registry.

How do we handle database schema evolution?

Use migration tools like Flyway or Liquibase. Favor additive migrations, avoid destructive changes in the same release, and use feature flags to decouple deploy from activate.

Is strict compatibility always worth the friction?

For internal experimental systems, maybe not. For shared event streams and public APIs, strict compatibility saves far more time than it costs.

Honest Takeaway

Schema evolution in large engineering teams is not a tooling problem. It is a coordination problem supported by tooling.

If you centralize schemas, enforce compatibility automatically, design for additive change, and observe real-world usage, you turn schema changes from production roulette into a managed process.

It takes discipline. It adds ceremony. But the alternative is waking up at 2 a.m. because someone renamed a field they thought no one was using.

At scale, contracts are infrastructure. Treat them that way.