Stop Performing SQL Surgery: Implement Row and Batch Contracts with AWAP
Relying on downstream SQL filters assumes ingestion is a harmless, additive activity. In stateful systems, it isn’t. Ingestion is an actionable event; by the time an anomaly triggers a downstream alert, your production state is already corrupted.
In the data engineering community, the debate usually oscillates between two extremes:
Strict Schema Enforcement
This approach treats any upstream drift, like a field renaming from user_id to UserID as a fatal error. While this prevents ‘garbage’ from entering the warehouse, it creates a tight coupling with upstream producers. The result is frequent SLA breaches: the pipeline halts until a human manually updates the schema, often turning the data team into an operational bottleneck.
Permissive Auto-Evolution
This strategy prioritises ‘uptime’ by blindly accepting raw payloads and relying on unchecked schema evolution. This creates immediate technical debt:
- Schema Scars: A single malformed row can force a permanent metadata mutation (e.g., promoting a
FLOATto aSTRING). This breaks downstream models and requires manual DDL reversals to fix. - State Corruption: Semantic errors such as a batch where 90% of values are
NULLcan pass structural validation but silently overwrite historical accuracy. In stateful tables, aMERGEof these values can wipe your SCD2 history, turning a simple bug into a multi-day recovery project.
Fixing these issues after the fact isn't ‘filtering’; it’s performing risky SQL surgery on production tables.
The Middle Ground: AWAP
Mature engineering requires a distinction between recoverable drift and destructive anomalies. You shouldn’t have to choose between a broken pipeline and a corrupted warehouse.
AWAP (Audit-Write-Audit-Publish) introduces a multi-gate validation layer. It provides the flexibility to accept minor upstream changes while maintaining a hard block on data that would compromise system integrity.
To implement AWAP effectively, you have to move away from the idea of a single data quality check. Real-world data corruption happens at different granularities, requiring a two-gate approach that separates Syntactic validation from Semantic validation.
The Two-Gate Architecture: Row vs. Batch
1. The Row-Level Failure (Micro)
These are syntactic violations: structural errors that can be identified row-by-row.
- Examples: Missing primary keys, type mismatches (string in an integer column), or malformed JSON.
- The Threat: The Schema Scar. Modern warehouses often use "Auto-Evolution." If one malformed row forces a DDL change (e.g.,
INT→STRING), the metadata is mutated forever. Even if you delete the row, the type change remains, breaking downstream models and requiring manual intervention.
2. The Batch-Level Failure (Macro)
These are semantic violations: logic problems that only appear when you look at the data in aggregate.
- Examples: A sudden spike in NULL values or a batch where 100% of the timestamps are identical.
- The Threat: State Corruption. Imagine a bug causes an upstream system to send
NULLfor auser_statusfield. Individually, aNULLis valid. But if 90% of your batch is suddenlyNULL, it’s a semantic failure. If youMERGEthis, you effectively "wipe" the history of your existing users, replacing known truth with unverified values.
Evolution: From WAP to AWAP
In data engineering, WAP (Write‑Audit‑Publish) is the standard pattern for protecting production. You write to staging, audit the batch, and then promote it.
Why extend to AWAP?
Standard WAP is insufficient when dealing with untrusted sources or auto-evolving schemas. If you wait until data is in staging to audit it, the Schema Scar is already born.
XXXXXXXXXXX
The Four-Stage Protocol:
- Audit (Pre-Ingest): A stateless filter checking for syntactic integrity (e.g., required keys, type-casting). Prevents malformed rows from triggering DDL changes.
- Write: Ingestion of the verified payload into the staging layer.
- Audit (Post-Staging): A stateful check against the batch (e.g., statistical drift, null-rate thresholds) to identify semantic anomalies.
- Publish: Atomic promotion (e.g.,
ALTER TABLE ... APPENDor View Swap) of the audited batch to production.
Practical Example: The Street Survey System
For the sake of seeing AWAP in action, we’ll use a simplified street‑survey system:
- Field agents use a mobile app to interview people on the street.
- Each submitted survey becomes one row with fields like
survey_id,agent_id,started_at, andsubmitted_atthat hits our pipeline.