Blog /November 11, 2024

Stop Performing SQL Surgery: Implement Row and Batch Contracts with AWAP

Relying on downstream SQL filters assumes ingestion is a harmless, additive activity. In stateful systems, it isn’t. Ingestion is an actionable event; by the time an anomaly triggers a downstream alert, your production state is already corrupted.

In the data engineering community, the debate usually oscillates between two extremes:

Strict Schema Enforcement

This approach treats any upstream drift, like a field renaming from user_id to UserID as a fatal error. While this prevents ‘garbage’ from entering the warehouse, it creates a tight coupling with upstream producers. The result is frequent SLA breaches: the pipeline halts until a human manually updates the schema, often turning the data team into an operational bottleneck.

Permissive Auto-Evolution

This strategy prioritises ‘uptime’ by blindly accepting raw payloads and relying on unchecked schema evolution. This creates immediate technical debt:

Schema Scars: A single malformed row can force a permanent metadata mutation (e.g., promoting a FLOAT to a STRING). This breaks downstream models and requires manual DDL reversals to fix.
State Corruption: Semantic errors such as a batch where 90% of values are NULL can pass structural validation but silently overwrite historical accuracy. In stateful tables, a MERGE of these values can wipe your SCD2 history, turning a simple bug into a multi-day recovery project.

Fixing these issues after the fact isn't ‘filtering’; it’s performing risky SQL surgery on production tables.

The Middle Ground: AWAP

Mature engineering requires a distinction between recoverable drift and destructive anomalies. You shouldn’t have to choose between a broken pipeline and a corrupted warehouse.

AWAP (Audit-Write-Audit-Publish) introduces a multi-gate validation layer. It provides the flexibility to accept minor upstream changes while maintaining a hard block on data that would compromise system integrity.

To implement AWAP effectively, you have to move away from the idea of a single data quality check. Real-world data corruption happens at different granularities, requiring a two-gate approach that separates Syntactic validation from Semantic validation.

The Two-Gate Architecture: Row vs. Batch

1. The Row-Level Failure (Micro)

These are syntactic violations: structural errors that can be identified row-by-row.

Examples: Missing primary keys, type mismatches (string in an integer column), or malformed JSON.
The Threat: The Schema Scar. Modern warehouses often use "Auto-Evolution." If one malformed row forces a DDL change (e.g., INT → STRING), the metadata is mutated forever. Even if you delete the row, the type change remains, breaking downstream models and requiring manual intervention.

2. The Batch-Level Failure (Macro)

These are semantic violations: logic problems that only appear when you look at the data in aggregate.

Examples: A sudden spike in NULL values or a batch where 100% of the timestamps are identical.
The Threat: State Corruption. Imagine a bug causes an upstream system to send NULL for a user_status field. Individually, a NULL is valid. But if 90% of your batch is suddenly NULL, it’s a semantic failure. If you MERGE this, you effectively "wipe" the history of your existing users, replacing known truth with unverified values.

Evolution: From WAP to AWAP

In data engineering, WAP (Write‑Audit‑Publish) is the standard pattern for protecting production. You write to staging, audit the batch, and then promote it.

Why extend to AWAP?

Standard WAP is insufficient when dealing with untrusted sources or auto-evolving schemas. If you wait until data is in staging to audit it, the Schema Scar is already born.

XXXXXXXXXXX

The Four-Stage Protocol:

Audit (Pre-Ingest): A stateless filter checking for syntactic integrity (e.g., required keys, type-casting). Prevents malformed rows from triggering DDL changes.
Write: Ingestion of the verified payload into the staging layer.
Audit (Post-Staging): A stateful check against the batch (e.g., statistical drift, null-rate thresholds) to identify semantic anomalies.
Publish: Atomic promotion (e.g., ALTER TABLE ... APPEND or View Swap) of the audited batch to production.

Practical Example: The Street Survey System

For the sake of seeing AWAP in action, we’ll use a simplified street‑survey system:

Field agents use a mobile app to interview people on the street.
Each submitted survey becomes one row with fields like survey_id, agent_id, started_at, and submitted_at that hits our pipeline.

SQL Benchmarking: comparing data pipeline tools

cognee - Scalable Data Layer for AI Apps - ECL Pipelines

Strict Schema EnforcementLink icon

Permissive Auto-EvolutionLink icon

The Middle Ground: AWAPLink icon

The Two-Gate Architecture: Row vs. BatchLink icon

1. The Row-Level Failure (Micro)Link icon

2. The Batch-Level Failure (Macro)Link icon

Evolution: From WAP to AWAPLink icon

Why extend to AWAP?Link icon

The Four-Stage Protocol:Link icon

Practical Example: The Street Survey SystemLink icon