Blog//
Schema evolution in data pipelines: the engineer's guide
Schema evolution is a decision every data pipeline makes — most tools make it silently. This post discusses the five common failure modes every data pipeline sees, how dlt handles them, and how you can decide runtime policies for schema evolution with data contracts.
Aman Gupta,
Data Engineer
On this page
- Intro
- Schema evolution vs schema drift vs data contracts
- The five failure modes every pipeline hits
- Adding a column
- Removing a column
- Type change (the painful one)
- Rename
- Nested structure changes
- Schema evolution example in Python with dlt
- Turning schema changes into a signal
- When to stop automatically evolving schemas?
- Schema evolution best practices
- How other tools handle schema evolution
- The pattern
- Further reading
Intro
The modern data stack has been solving schema evolution for a decade, one slice at a time. Confluent built a schema registry. Databricks shipped mergeSchema in Delta. Snowflake added ENABLE_SCHEMA_EVOLUTION. Iceberg made renames a metadata operation. The dominant managed ELT vendor picked "net-additive" as its default. dbt added contracts and model versions. BigQuery has schema autodetection on the top nesting layer.
All of them are useful in the slice they own, but none of them is a portable ingestion solution you can run across platforms. None of them answer the question: the shape of the incoming row no longer matches the shape of the target table, what do I do right now?

That question lives at the ingestion layer. It is a runtime policy, not a storage feature and not a governance framework. And it is the layer every vendor above sits next to but not inside, because they sit either upstream of it (Kafka, producers) or downstream (warehouse, transforms, BI). The ingestion layer is where you (not a vendor) own the decision.
This post is a practitioner's guide to that decision: what schema evolution actually is, the five upstream failure modes that drive it, how to automate it (with working Python code using dlt), when to stop automating and enforce a contract, and how to turn every schema change into a signal your team sees before your dashboard does.
Schema evolution vs schema drift vs data contracts
- Schema evolution is the runtime policy: the decision your pipeline makes when an incoming row doesn't match the target. It is a control question, and every pipeline answers it whether the engineer configured it or not.
- Schema drift is the observation that something upstream changed. It is a detection problem. A pipeline can experience drift without having any evolution policy beyond the default; in that case the default is what the engineer ends up shipping to production.
- Data contracts are the enforcement: a declaration that certain shapes are allowed and others aren't. A contract without a detection mechanism is wishful thinking. A detection mechanism without a policy is a dashboard.
Evolution acts. Drift reports. Contracts constrain.
The vocabulary around this is inconsistent across vendors. Databricks uses "schema evolution" for the whole cluster. One ETL tool uses "schema drift" for the same thing. Confluent uses "evolution" to describe wire-format compatibility. The words are messy. The three functions aren't.
The five failure modes every pipeline hits
Most schema surprises reduce to five patterns. Each breaks a different downstream thing. Each has a different right answer.

Adding a column
A new field appears in the incoming payload. Warehouses usually support this:
- BigQuery:
schemaUpdateOptions=ALLOW_FIELD_ADDITION - Snowflake:
ENABLE_SCHEMA_EVOLUTION=TRUE - Delta Lake:
mergeSchema=true - Iceberg:
ADD COLUMN, metadata-only
This is the easy case, which is why every vendor picked it as the default. It is also where a bad habit starts: if your pipeline silently adds columns, your dashboard silently gets new fields, and nobody on the consumer side knows until a number moves. The question isn't whether to allow it. The question is whether to announce it.
Removing a column
The field disappears from the payload. Your options compress: drop the column (destroys history), keep it and write NULLs forward (preserves history but makes the dashboard look like adoption fell off a cliff), or fail loudly (forces a conversation with the producer).
Most ingestion tools, dlt included, default to option two. That is the right default at the bronze layer. It is the wrong default at the gold layer, where a column going to NULL for every new row will mislead every consumer downstream. The fix is a better data contract at the ingestion layer.
Type change (the painful one)
amount was int. This morning it arrived as "12.50". Three real options:
- Fail the load. Honest, but your pipeline is down until a human decides what to do.
- Coerce. Cast the string to int or vice versa. Guesses what the producer meant.
- Split. Keep the original column, add a second one for the new type, preserve both.
dlt picks option 3: it creates a variant column (amount__v_text) alongside the existing amount. Existing queries don't break. The new type is preserved. A downstream transform has to decide what to do with both, which is the right place for the decision to live.
Rename
A producer renames user_email to email. This is the one everyone lies about.
Iceberg can actually rename: it tracks columns by ID, not by name, and a rename is a metadata operation. Delta can, in column-mapping mode. Every other system, including dlt, including every warehouse's default mode, cannot. A "rename" at the ingestion layer is actually add the new column, stop populating the old one. Your history splits.
The honest fix is the expand-and-contract migration: add the new column, dual-write for a cycle, backfill the old data into the new column, cut reads over, then drop the old column. No tool automates this because it is a governance decision, not a code path.
Nested structure changes
Root-level changes are rare in practice. What actually changes is three levels deep: order.items.variant.pricing.currency flips from a string to an object, or a new nested list appears that you would rather not unpack into its own table without review.
dlt handles this by unpacking nested structures into relational tables (order__items__variant__pricing__currency). That means every nested change is visible at the table level, not hidden inside a JSON blob you find out about when a query fails. It also means contract decisions can be made per nested table, not globally, which matches how the problem actually appears.
Schema evolution example in Python with dlt
We saw how schemas break — here's how a dlt pipeline handles it. Two runs. The second run has both an added field and a type change.
import dlt
@dlt.resource(write_disposition="append")
def user_events(data):
yield data
pipeline = dlt.pipeline(
pipeline_name="events",
destination="duckdb",
dataset_name="raw",
)
# Run 1: baseline
pipeline.run(user_events([
{"user_id": 1, "event_type": "signup", "amount": 0},
]))
# Run 2: new field (session_id), and amount arrives as string
info = pipeline.run(user_events([
{"user_id": 2, "event_type": "purchase",
"amount": "12.50", "session_id": "abc"},
]))
# Inspect what actually changed
for package in info.load_packages:
for table, update in package.schema_update.items():
print(table, update)
The second run does the following:
- adds a
session_idcolumn and - creates an
amount__v_textvariant column alongside the existingamount.
Your first-run rows are intact. Your second-run rows are intact. No human wrote DDL. No pipeline crashed. The schema_update payload tells you exactly what changed.
That is the junior bar: it runs. The senior bar: does anyone know a variant column was created? If not, you have moved the problem from "pipeline crashed" to "pipeline lied quietly," which is worse. A crashed pipeline gets fixed. A lying one gets quoted in a board meeting.
The mechanics are covered in the dlt schema evolution docs. The rest of this post is the part the docs leave to you: policy.
Turning schema changes into a signal
The schema_update dict is not decorative. It is the thing that lets you tell the consumer before they notice it themselves.
import json
from dlt.common.runtime.slack import send_slack_message
info = pipeline.run(user_events(...))
for package in info.load_packages:
if package.schema_update:
send_slack_message(
"<https://hooks.slack.com/services/>...",
f"Schema changed in `{pipeline.pipeline_name}`:\\n"
f"```{json.dumps(package.schema_update, indent=2)}```"
)Two things worth naming about this snippet:
- It executes the moment the load completes, as part of your pipeline code. A reactive observability tool like Monte Carlo or Elementary runs on its own scan schedule. By then, the next downstream job may have already run.
- The payload is structured Python, not a stack trace. Route it anywhere: Slack for the team that owns the source, PagerDuty if it is a gold table, a Linear ticket if it is a governance question. The owner changes by layer, and the routing should too.
If you don't name the owner, the owner becomes the incident.
And if the change should never have landed?
The same payload is also how you gate a CI run. If the table you're loading is supposed to be locked, an unexpected schema_update is a signal the contract and the code have drifted apart:
info = pipeline.run(payments())
for package in info.load_packages:
if package.schema_update:
raise RuntimeError(
f"Unexpected schema change: {package.schema_update}. "
"Either update the contract or fix the producer."
)Running this against a test fixture on every PR catches the drift before it reaches production. The same check that alerts you in prod gates the merge in CI.
When to stop automatically evolving schemas?
Automatic evolution is the right default for bronze/raw ingestion. It is the wrong default for anything a dashboard, model, or downstream team depends on. The reason is economic: auto-evolve pushes the cost of a surprise onto the consumer, who finds out when a number moves. A data contract stops it at the boundary, closest to where the change originated.
Three cases where contracts are non-negotiable:
- Regulated pipelines. A new column might be PII. Silent evolution at a gold table is a compliance incident.
- Shared gold tables with many consumers. Silent additions are implicit API changes. Every dashboard, model, and report that depends on the table just had its contract mutated without a review.
- Unknown-provenance sources. Public APIs, scraped data, vendor exports. You have no relationship with the producer and no visibility into what they'll send next. The only thing between their schema and your warehouse is a contract.
dlt has four contract modes, applied at the resource level:
@dlt.resource(
write_disposition="append",
schema_contract={
"tables": "freeze", # no new tables
"columns": "evolve", # new columns are fine
"data_type": "freeze", # but no type changes
},
)
def payments():
...The four modes:
evolve: the permissive default. Accept the change, adjust the schema.freeze: reject the load. RaisesDataValidationError. Use when breakage is better than silent drift.discard_row: drop rows that violate the contract; keep the rest of the load going.discard_value: keep the row, null out the offending value.
Granularity matters more than mode choice. You can (and should) mix: {"tables": "freeze", "columns": "evolve", "data_type": "freeze"} means no new tables, no type surprises, new columns are fine. That is a realistic silver-layer policy.
| Layer | Consumer | Default Policy |
|---|---|---|
| Raw / Bronze | Debugging, exploration | evolve everywhere |
| Staging / Silver | dbt transforms | columns: evolve, data_type: freeze |
| Marts / Gold | Dashboards, ML, external | tables: freeze, columns: freeze, data_type: freeze + Pydantic model |
| High-volume events with flaky producers | Real-time features | discard_row on the offending nested table |

For gold tables, the contract should live in code, not in configuration. Here's what that contract looks like with Pydantic models:
from pydantic import BaseModel
from typing import Literal
class Payment(BaseModel):
payment_id: str
amount: float
currency: Literal["USD", "EUR", "GBP"]
customer_id: str
@dlt.resource(
columns=Payment,
write_disposition="append",
schema_contract={"columns": "freeze", "data_type": "freeze"},
)
def payments():
...A new field in the payload that isn't on the model is rejected. A type change is rejected. The model becomes the authoritative description of the table, and it sits in version control next to the pipeline, which means code review catches contract changes the same way it catches any other breaking change. The contract stops being a config string and becomes a reviewable artifact.
One mechanical note: contracts do not apply to the first load of a brand-new table (internally flipped to evolve so the table can be created). If your first-run example uses freeze, it will still succeed; the contract kicks in on subsequent loads. The full reference lives in the schema and data contracts docs.
A second note: discard_value is not supported when validating against a Pydantic model. Use discard_row with Pydantic if row-level rejection is what you want.
Schema evolution best practices
If you are starting fresh:
- Pick the pipeline with the most downstream consumers. That is where enforcement matters most.
- Figure out which of the five failure modes has bitten you in the last quarter. That is the mode to harden against first.
- Apply the medallion rule: bronze evolves, silver partially locks, gold fully locks with a Pydantic model as the authoritative contract.
- Wire
schema_updateto wherever the on-call engineer actually reads. Slack, Linear, PagerDuty. The routing matters more than the channel. - Write down who owns each table.
If you are retrofitting an existing pipeline, do step 4 first. Detection buys you the time to fix the rest.
How other tools handle schema evolution
Schema evolution is an ingestion-layer decision, but the tools around it each own a boundary. The differences are not just whether they handle a schema change. They are who finds out, when, and with enough context to do anything about it.
Take each tool in turn. dbt catches what dlt lets through; its contracts fail the run before a type change reaches a consumer. SQLMesh adds one thing on top of that: its plan command tells you which downstream models break before you promote the change. Kafka's Schema Registry enforces compatibility at the producer; once that stream reaches a warehouse, dlt picks up the schema policy from there. BigQuery lets you add new columns automatically at load time, no DDL required, helpful for a class of pipelines but silent by design. Snowflake's ENABLE_SCHEMA_EVOLUTION and Delta Lake's mergeSchema do the same, new columns land without any announcement. The two popular managed ELT vendors each take a different approach: one that automatically propagates every upstream schema change by default, with only coarse blocking rather than fine-grained policies, the other per-connector where the same upstream change can mean a silent NULL forward in one source and a full resync in another.
The pattern
Each tool guards a different boundary: Kafka owns the wire, dbt and SQLMesh own the transform, BigQuery owns the destination DDL. All of them, managed ELT vendors included, handle the mechanics. None of them handle the decision. dlt is the only point in this stack where the change becomes an event you can route to a human as part of the pipeline run itself.
Further reading
- Schema evolution in dlt docs: the mechanics
- Schema and data contracts reference: all four modes, with Pydantic integration
- Alerting in production: wiring
schema_updateto Slack, Sentry, and other channels - Destination tables and lineage: variant column naming and load metadata