Skip to main content
Version: 1.20.0 (latest)

Data quality ๐Ÿงช

warning

๐Ÿšง This feature is under development. Interested in becoming an early tester? Join dltHub early access.

dltHub will allow you to define data validation rules in Python. This ensures your data meets expected quality standards at the ingestion step.

Key featuresโ€‹

With dltHub, you will be able to:

  • Define data tests and quality contracts in Python.
  • Apply both row-level and batch-level validation.
  • Enforce constraints on distributions, boundaries, and expected values.

Stay tuned for updates as we expand these capabilities! ๐Ÿš€

Checksโ€‹

A data quality check or check is a function applied to data that returns a check result or result (can be boolean, integer, float, etc.). The result that is converted to a success / fail check outcome or outcome (boolean) based on a decision.

info

A test verifies that code behaves as expected. A check verifies that the data meets some expectations. Code tests enable you to make changes with confidence and data checks help monitor your live systems.

For example, the check is_in(column_name, accepted_values) verifies that the column only includes accepted values. Running the check counts successful records to compute the success rate (result). The outcome will be a success if success rate is 100%, i.e., all records succeeded the check (decision).

This snippet shows a single is_in() check being ran against the orders table in the point_of_sale dataset.

import dlt
from dlt.hub import data_quality as dq

dataset = dlt.dataset("duckdb", "point_of_sale")
checks = [
dq.checks.is_in("payment_methods", ["card", "cash", "voucher"])
]

# prepare a query to execute all checks
check_plan = dq.prepare_checks(dataset["orders"], checks=checks)
# execute checks and get results in-memory
check_plan.df()

Check levelโ€‹

The check level indicates the granularity of the check result. For instance:

  • Row-level checks produce a result per record. It's possible to inspect which specific records pass / failed the check.

  • Table-level checks produce a result per table (e.g., result is "the number of unique values" and decision is "is this greater than 5?").

    These checks can often be rewritten as row-level checks (e.g., "is this value unique?")

  • Dataset-level checks produce a result per dataset. This typically involves multiple tables, temporal comparisons, or pipeline information (e.g., "the number of rows in 'orders' is higher than the number of rows in 'customers')

important

Notice that the check level relates to the result and not the input data of the check. For instance, a row-level check can involve multiple tables as input.

Built-in checksโ€‹

The library dlthub includes many built-in checks: is_in(), is_unique(), is_primary_key(), and more. The built-in case() and where() simplify custom row and table-level checks respectively.

For example, the following are equivalent:

from dlt.hub import data_quality as dq

dq.checks.is_unique("foo")
dq.checks.case("COUNT(*) OVER (PARTITION BY foo) > 1")

Custom checks (WIP)โ€‹

Can be implemented as a dlt.hub.transformation that matches a specific schema or as subclass of _BaseCheck for full control. This allows to use any language supported by transformations, allowing eager/lazy and in-memory/on-backend execution.

Notes:

  • Should have utilities to statically validate check definitions (especially lazy)
  • Should have testing utilities that makes it easy to unit test checks (same utilities as transformations)

Lifecycleโ€‹

Data quality checks can be executed at different stages of the pipeline lifecycle. This choice has several impacts, including:

  • the input data available for the check
  • the compute resources used
  • the actions available (e.g., drop invalid load)

Post-loadโ€‹

The post-load execution is the simplest option. The pipeline goes through Extract -> Normalize -> Load as usual. Then, the checks are executed on the destination.

Properties:

  • Failed records can't be dropped or quarantined before load. All records must be written, checked, and then handled. This only works with write_disposition="append" or destinations supporting snapshots (e.g. iceberg, ducklake).
  • Checks have access to the full dataset. This includes current and past loads + internal dlt tables.
  • Computed directly on the destination. This scales well with the size of the data and the complexity of the checks.
  • Results and outcome are directly stored on the dataset. No data movement is required.

Pre-load (staging)โ€‹

The pre-load execution via staging dataset allows to execute checks on the destination and trigger actions before data is loaded into the dataset. This is effectively using post-load checks before a 2nd load phase.

info

dlt uses staging datasets for other features such as merge and replace write dispositions.

Properties:

  • Failed records can be dropped or quarantined before load. This works with all write_disposition
  • Requires a destination that supports staging datasets.
  • Checks have access to the current load.
    • If the staging dataset is on the same destination, checks can access the full dataset.
    • If the staging dataset is on a different destination, communication between the staging dataset and the dataset.
  • Computed on the staging destination. This scales well with the size of the data and the complexity of the checks.
  • Data and checks results & outcome can be safely stored on the staging dataset until review. This helps human-in-the-loop workflows without reprocessing the full pipeline.

Pre-load (in-memory)โ€‹

The pre-load execution in-memory will execute checks using duckdb against the load packages (i.e., temporary files) stored on the machine that runs dlt. This allows to trigger actions before data is loaded into the destination.

note

This is equivalent to using a staging destination that is the local filesystem. This section highlights the trade-offs of this choice.

Properties:

  • Failed records can be dropped or quarantined before load. This works with all `write_disposition
  • Checks only have access to the current load. Checking against the full dataset requires communication between the staging destination and the main destination.
  • Computed on the machine running the pipeline. The resource need to match the compute requirements.
  • Data and checks results & outcome may be lost if the runtime is ephemeral (e.g., AWS Lambda timeout). In this case, the pipeline must process the data again.

Migration and versioning (WIP)โ€‹

As the real-world change, their can be addition, removal, or modification of data quality checks for your pipeline / dataset. This is require for proper auditing.

For example, the check is_in("division", ["europe", "america"]) defined in 2024 could evolve to is_in("division", ["europe", "america", "asia"]) in 2026.

Notes:

  • checks need to be serialized and hashed (trivial for lazy checks)
  • checks can be stored on schema (consequently on the destination too)
  • this is the same challenge as versioning transformations

Action (WIP)โ€‹

After running checks, actions can be triggered based on the check result or check outcome.

Notes:

  • actions can be configured globally or per-check
  • planned actions: drop data, quarantine data (move to a special dataset), resolve (e.g., fill value, set default, apply transformation), fail (prevents load), raise/alert (sends notification)
  • This needs to be configurable from outside the source code (e.g., via config.toml). The same checks would require different action during development vs. prod

Metricsโ€‹

This demo works on codespaces. Codespaces is a development environment available for free to anyone with a Github account. You'll be asked to fork the demo repository and from there the README guides you with further steps.
The demo uses the Continue VSCode extension.

Off to codespaces!

DHelp

Ask a question

Welcome to "Codex Central", your next-gen help center, driven by OpenAI's GPT-4 model. It's more than just a forum or a FAQ hub โ€“ it's a dynamic knowledge base where coders can find AI-assisted solutions to their pressing problems. With GPT-4's powerful comprehension and predictive abilities, Codex Central provides instantaneous issue resolution, insightful debugging, and personalized guidance. Get your code running smoothly with the unparalleled support at Codex Central - coding help reimagined with AI prowess.