Skip to main content
Version: 1.20.0 (latest)

Data quality lifecycle

The data quality lifecycle has rarely been achievable in a single tool due to the runtime constraints of traditional ETL vendors.

One library, end-to-end ingestion, transformation, with data quality and lineage

dlt is an open source pythonic ingestion library while dlthub is a commercial addition to dlt spanning into tranformation and other areas of the data stack.

Because dlt together with dltHub span the entire pipeline, starting from ingestion, passing through a portable staging layer, and extending into the transformation, it uniquely bridges these gaps.

Instead of stitching together four or five separate tools, you write Python code that works across the entire pipeline. No glue scripts. No context lost between systems, end to end lineage and metadata.

Data Quality Lifecycle

The three checkpoints for data quality:

  1. In-flight: Check individual records as data is extracted, before loading it.
  2. Staging: We optionally load the data to an optionally transient staging area where we can test it without breaking production.
  3. Destination: Check properties of the full dataset currently written to the destination.

The five pillars of data quality

dlt addresses quality across five core dimensions, offering support for implementing these checks across the entire data lifecycle.

  1. Structural Integrity: Does the data fit the destination schema and types?
  2. Semantic Validity: Does the data make business sense?
  3. Uniqueness & Relations: Is the dataset consistent with itself?
  4. Privacy & Governance: Is the data safe and compliant?
  5. Operational Health: Is the pipeline running correctly?

Five Pillars of Data Quality


1. Structural Integrity

Does the data fit the destination schema and types?

These checks ensure incoming data conforms to the expected shape and technical types before loading, preventing broken pipelines and "garbage" tables.

Job to be Donedlt SolutionLearn MoreAvailability
Prevent unexpected columnsSchema Contracts (Frozen Mode): Set your schema to frozen to raise an immediate error if the source API adds an undocumented field.Schema Contractsdlt
Enforce data typesType Coercion: dlt automatically coerces compatible types (e.g., string "100" to int 100) and rejects non-coercible values to ensure column consistency.Schemadlt
Fix naming errorsNormalization: dlt automatically cleans table and column names (converting to snake_case) to prevent SQL syntax errors in the destination.Naming Conventiondlt
Enforce required fieldsNullability Constraints: Mark fields as nullable=False in your resource hints to drop or error on records missing critical keys.Resourcedlt

2. Semantic Validity

Does the data make business sense?

These checks verify the content of the data against business logic. While structural checks handle types (is it a number?), semantic checks handle meaning (is it a valid age?).

Job to be Donedlt SolutionLearn MoreAvailability
Validate logic & rangesPydantic Models: Attach Pydantic models to your resources to enforce logic like age > 0 or email format validation in-stream.Schema Contractsdlt
Filter bad rowsadd_filter: Apply a predicate function to exclude records that don't meet criteria (e.g., lambda x: x["status"] != "deleted").Transform with add_mapdlt
Check batch anomaliesStaging Tests: Use the portable runtime (e.g., Ibis/DuckDB) to query the staging buffer. Example: "Alert if the average order value in this batch is > $10k."Stagingdlt
Built-in data checksData Quality Checks: Use built-in checks like is_in(), is_unique(), is_primary_key() with pre-load or post-load execution, plus actions on failure (drop, quarantine, alert).Data Qualitydlthub

3. Uniqueness & Relations

Is the dataset consistent with itself?

These checks manage duplication and preserve relationships between different tables in your dataset.

Job to be Donedlt SolutionLearn MoreAvailability
Prevent duplicatesMerge Disposition: Define primary_key and write_disposition='merge' to automatically upsert records. dlt handles the deduping logic for you.Incremental Loadingdlt
Track historical changesSCD2 Strategy: Use write_disposition={"disposition": "merge", "strategy": "scd2"} to automatically maintain validity windows (_dlt_valid_from, _dlt_valid_to) for dimension tables.Merge Loadingdlt
Link parent/child dataAutomatic Lineage: dlt automatically generates foreign keys (_dlt_parent_id) when unnesting complex JSON, preserving the link between parent and child tables.Destination Tablesdlt
Find orphan keysPost-Load Assertions: Run SQL tests on the destination to identify child records missing a valid parent (e.g., Orders without Customers).SQL Transformationsdlt

4. Privacy & Governance

Is the data safe and compliant?

Data quality also means compliance. These features ensure sensitive data is handled correctly before it becomes a liability in your warehouse.

Job to be Donedlt SolutionLearn MoreAvailability
Mask/Hash PIITransformation Hooks: Use add_map to hash emails or redact names in-stream. Data is sanitized in memory before it ever touches the disk.Pseudonymizing Columnsdlt
Drop sensitive columnsColumn Removal: Use add_map to completely remove columns (e.g., ssn, credit_card) before they ever reach the destination.Removing Columnsdlt
Enforce PII ContractsPydantic Models: Use Pydantic schemas to strictly define and detect sensitive fields (e.g., EmailStr), ensuring they are caught and hashed before loading.Schema Contractsdlt
Join on private dataDeterministic Hashing: Use a secret salt via dlt.secrets to deterministically hash IDs, allowing you to join tables on "User ID" without exposing the actual user identity.Credentials Setupdlt
Track PII through transformationsColumn-Level Hint Forwarding: PII hints (e.g., x-annotation-pii) are automatically propagated through SQL transformations, so downstream tables retain knowledge of sensitive origins.Transformationsdlthub

5. Operational Health

Is the pipeline running correctly?

Monitoring the reliability of the delivery mechanism itself. Even perfectly valid data is "bad quality" if it arrives 24 hours late.

Job to be Donedlt SolutionLearn MoreAvailability
Detect empty loadsVolume Monitoring: Inspect load_info metrics after a run. Trigger an alert if row_count drops to zero unexpectedly.Running in Productiondlt
Collect custom metricsCustom Metrics: Track business-specific statistics during extraction (e.g., page counts, API call counts) using dlt.current.resource_metrics().Resourcedlt
Monitor Freshness (SLA)Load Metadata: Query the _dlt_loads table in your destination to verify the inserted_at timestamp meets your freshness SLA.Destination Tablesdlt
Audit Schema DriftSchema History: Even in permissive modes, dlt tracks every schema change. Use the audit trail to see exactly when a new column was introduced.Schema Evolutiondlt
Alert on schema changesSchema Update Alerts: Inspect load_info.load_packages[].schema_update after each run to detect new tables/columns and trigger alerts (e.g., Slack notification when schema drifts).Running in Productiondlt
Trace LineageLoad IDs: Every row in your destination is tagged with _dlt_load_id. You can trace any specific record back to the exact pipeline run that produced it.Destination Tablesdlt
Alert on failuresSlack Integration: Send pipeline success/failure notifications via Slack incoming webhooks configured in dlt.secrets.Alertingdlt

Validate data quality during development

Use the dlt Dashboard to interactively inspect your pipeline during development. The dashboard lets you:

  • Query loaded data and verify row counts match expectations
  • Inspect schemas, columns, and all column hints
  • Check incremental state of each resource
  • Review load history and trace information
  • Catch issues like pagination bugs (suspiciously round counts) before they reach production
dlt pipeline {pipeline_name} show

Get the full lifecycle with dltHub

The features marked dlt in the tables above are available today in the open-source library. dltHub provides a managed runtime and additional data quality capabilities:

  • Run dlt on the dltHub runtime — Execute all your existing dlt pipelines with managed infrastructure, scheduling, and observability built-in.
  • Built-in data quality checks — Use is_in(), is_unique(), is_primary_key(), and more with row-level and batch-level validation.
  • Pre-load and post-load execution — Run checks in staging before data hits your warehouse, or validate after load with full dataset access.
  • Follow-up actions on failure — Bad data quarantine to enable faster debugging.
  • Column-level hint forwarding — Track PII and other sensitive column hints through SQL transformations.
Early Access

Interested in the full data quality lifecycle? Join dltHub early access to get started.

Learn more about dltHub Data Quality →

This demo works on codespaces. Codespaces is a development environment available for free to anyone with a Github account. You'll be asked to fork the demo repository and from there the README guides you with further steps.
The demo uses the Continue VSCode extension.

Off to codespaces!

DHelp

Ask a question

Welcome to "Codex Central", your next-gen help center, driven by OpenAI's GPT-4 model. It's more than just a forum or a FAQ hub – it's a dynamic knowledge base where coders can find AI-assisted solutions to their pressing problems. With GPT-4's powerful comprehension and predictive abilities, Codex Central provides instantaneous issue resolution, insightful debugging, and personalized guidance. Get your code running smoothly with the unparalleled support at Codex Central - coding help reimagined with AI prowess.