Schema Evolution Governance
Architecture Diagram
The Breaking Change That Nobody Saw
A platform team adds a required region field to a UserUpdated Kafka event. They test their producer, everything serializes correctly, they deploy. Within 20 minutes, three downstream services start throwing deserialization errors. The analytics pipeline silently drops 40% of events because it uses lenient JSON parsing and the missing field causes null pointer exceptions deep in the transformation logic. The notification service crashes entirely. The fraud detection system keeps running but produces incorrect risk scores because it falls back to a default value for the missing field.
This is not a hypothetical. Variations of this story play out weekly at companies running event-driven architectures without schema governance. The root cause is never technical. It is organizational: no one enforced a review process for schema changes to shared contracts.
Protobuf vs Avro: An Honest Comparison
Protobuf dominates at Google, Square, and most gRPC-native shops. Field numbering makes evolution explicit. You never rename or reuse field numbers. Required fields were deprecated in proto3 precisely because they make evolution dangerous. Tooling is excellent: buf provides linting, breaking change detection, and a schema registry in one CLI. Performance is roughly 2-5x faster than Avro for serialization and deserialization.
Avro is the default in the Kafka ecosystem. Reader-writer schema resolution means a consumer can read data written with a different (compatible) schema version without any code changes. The schema travels with the data (or is looked up from a registry by schema ID). LinkedIn built Avro into their event infrastructure from the beginning, and it handles thousands of schema versions in production. The downside: Avro's dynamic typing means less compile-time safety, and tooling outside the JVM ecosystem is weaker.
If you are a Kafka-heavy shop with mostly JVM services, Avro with Confluent Schema Registry is the path of least resistance. If you are polyglot with gRPC or Connect, Protobuf with buf gives you stronger guarantees and better developer ergonomics.
Schema Registry and Compatibility Modes
Confluent Schema Registry (or its open-source alternatives like Karapace and Apicurio) enforces compatibility rules at write time. When a producer tries to register a new schema version, the registry checks it against previous versions.
Backward compatible: new schema can read data written with the old schema. Safe to deploy consumers first. Allows adding optional fields, removing fields with defaults.
Forward compatible: old schema can read data written with the new schema. Safe to deploy producers first. Allows removing optional fields, adding fields with defaults.
Full compatible: both directions. Deploy in any order. This is the strictest mode, and the one most teams should default to for shared topics.
Set compatibility mode per-topic, not globally. Internal service-to-service topics might tolerate BACKWARD. Topics consumed by external teams or data pipelines should enforce FULL.
Schema Governance That Actually Works
LinkedIn's schema review process requires any change to a shared event schema to go through a design review with representatives from consuming teams. Netflix takes a different approach: they auto-generate compatibility reports and block merges that violate the configured compatibility mode.
A practical governance setup for most organizations:
- CI compatibility checks. Every PR that modifies a
.protoor.avscfile runs against the schema registry's compatibility endpoint. Failures block the merge.buf breakingdoes this for Protobuf. Confluent's Maven plugin does it for Avro. - Schema change review. Changes to schemas consumed by more than one team require sign-off from at least one consuming team. This is a CODEOWNERS rule, not a process document.
- Consumer-driven contract tests. Each consumer publishes a Pact contract describing the fields and formats it depends on. Producer CI runs these contracts before deploying. This catches semantic changes that syntactic compatibility checks miss.
- Schema changelog. Maintain a CHANGELOG alongside your schema definitions. Version bumps, field additions, deprecations, and migration guides go here. Treat schemas like public APIs, because that is exactly what they are.
Key Points
- •A required field added to a shared Kafka event will break every downstream consumer that uses permissive deserialization. Schema evolution is a coordination problem, not a serialization problem
- •Protobuf and Avro solve different evolution challenges. Protobuf gives you better tooling, type safety, and performance. Avro gives you schema evolution by default with its reader-writer schema resolution. Pick based on your ecosystem, not blog post benchmarks
- •Schema compatibility modes (backward, forward, full) are not academic categories. They map directly to deployment order. Backward compatibility means you can deploy consumers before producers. Forward means the opposite. Full means deploy in any order
- •Consumer-driven contract testing catches breaking changes that schema registries miss. A field that is technically compatible can still break a consumer if the semantics change (e.g., a timestamp field switching from UTC to local time)
Common Mistakes
- ✗Treating schema compatibility checks as optional. Without CI enforcement, a developer will eventually push a breaking change to a shared topic at 4pm on a Friday, and three downstream teams will spend their evening debugging silent data loss
- ✗Using JSON without a schema at all. Teams love the flexibility of schemaless JSON events until they discover that Producer A sends 'user_id' as a string while Producer B sends 'userId' as an integer, and both have been writing to the same topic for six months
- ✗Assuming backward compatibility is always sufficient. If your producers deploy before consumers (common in platform teams that ship shared events), you need forward compatibility so that old consumers can read new messages
- ✗Skipping semantic versioning for schemas. A schema that adds an optional field feels safe, but if that field changes the interpretation of existing fields, consumers need to know. Version your schemas explicitly and document what changed