Skip to main content
Version: devel View Markdown

Data quality (advanced)

warning

๐Ÿšง This feature is under development. Interested in becoming an early tester? Join dltHub early access.

This page covers more in-depth details about data quality features, such as metrics and checks. It shows the full flexibility available for advanced use cases. It can also serve as an FAQ and helper for debugging.

Note that these APIs are more likely to change than the basics page.

Metrics and checks are DltSource objects.โ€‹

Internally, metrics and checks are defined as dlt.source objects that are executed by the pipeline. You can interact with these objects directly, unlocking the ability to:

  • load data into destination A and write metrics and checks to destination B
  • run data quality checks and metrics disjointed from the primary pipeline.run() (e.g., on a schedule, after another trigger, etc.)
  • dynamically modify metrics and checks to execute via Python code
  • have a central data quality pipeline that you run for multiple pipelines / datasets

Metrics and checks use a special dlt.Schemaโ€‹

Currently, a dlt.Dataset is associated with a single dlt.Schema. This defines what tables are known and available to the dataset's API (e.g., dataset.table(...)).

But a dlt.Pipeline can be associated with multiple dlt.Schema, such as when loading several sources.

The data quality metrics and checks sources have a dedicated _dlt_data_quality schema with internal tables for metrics and checks results. These tables are not directly visible via the dlt.Dataset API. (this may change in the future)

The dq.read_metric() and dq.read_check() retrieve data from these internal tables.

Checks anatomy: result, decision, outcome, levelโ€‹

In its simplest form, a check is a function that returns an outcome (boolean) that indicates success or failure. Under the hood, it involves a result (any type) that is computed and converted to an outcome via a decision function.

For example, the check is_in(column_name, accepted_values) has a success outcome if all values are valid values.

A more granular implementation could be:

  • count the number of records with valid values (result)
  • count the number of records
  • if the ratio of valid records is higher than 0.95 (decision)
  • the outcome is a success, else it is a failure

Most built-in checks return the outcome directly, but they can be configured to store intermediary results. Those are most useful for custom checks or when wanting to assign different tolerance thresholds based on the environment / context.

Check levelโ€‹

So far, we explained how a result is converted into an outcome. The check level describes the granularity of the result.

For instance:

  • Row-level checks produce a result per record. It's possible to inspect which specific records pass / failed the check.

  • Table-level checks produce a result per table (e.g., result is "the number of unique values" and decision is "is this greater than 5?").

    These checks can often be rewritten as row-level checks (e.g., "is this value unique?")

  • Dataset-level checks produce a result per dataset. This typically involves multiple tables, temporal comparisons, or pipeline information (e.g., "the number of rows in orders is higher than the number of rows in 'customers')

important

Notice that the check level relates to the result and not the input data of the check. For instance, a row-level check can involve multiple tables as input.

This demo works on codespaces. Codespaces is a development environment available for free to anyone with a Github account. You'll be asked to fork the demo repository and from there the README guides you with further steps.
The demo uses the Continue VSCode extension.

Off to codespaces!

DHelp

Ask a question

Welcome to "Codex Central", your next-gen help center, driven by OpenAI's GPT-4 model. It's more than just a forum or a FAQ hub โ€“ it's a dynamic knowledge base where coders can find AI-assisted solutions to their pressing problems. With GPT-4's powerful comprehension and predictive abilities, Codex Central provides instantaneous issue resolution, insightful debugging, and personalized guidance. Get your code running smoothly with the unparalleled support at Codex Central - coding help reimagined with AI prowess.