Skip to main content
Version: 1.7.0 (latest)

🧪 Python-based transformations

dlt+

This page is for dlt+, which requires a license. Join our early access program for a trial license.

caution

🚧 This feature is under development, and the interface may change in future releases. Interested in becoming an early tester? Join dlt+ early access.

dlt+ allows you to define Arrow-based transformations that operate on a cache. The actual transformation code is located in the ./transformations folder. In this section, you will learn how you can define Arrow-based transformations with Python.

Generate template​

Since this feature is still under development and documentation is limited, we recommend starting with a template. You can generate one using the following command:

note

Make sure you have configured your cache and transformation in the dlt.yml file before running the command below.

dlt transformation <transformation-name> render-t-layer

Running this command will create a new set of transformations inside the ./transformations folder. The generated template includes:

  • Transformation functions that manage incremental loading state based on dlt_load_id.
  • Two transformation functions that implement user-defined transformations.
  • A staging view, which pre-selects only rows eligible for the current transformation run.
  • A main output table, which initially just forwards all incoming rows unchanged.

If you run the generated transformations without modifying them, the execution will fail. This happens because your cache expects an aggregated table corresponding to the <transformation-name>, but the newly created transformations do not include it. To resolve this, you can either:

  • Update your cache settings to match the new transformation.
  • Implement a transformation that aligns with the expected table structure.

Understanding incremental transformations​

The default transformations generated by the scaffolding command work incrementally using the dlt_load_id from the incoming dataset. Here's how it works:

  1. The dlt_loads table is automatically available in the cache.
  2. The transformation layer identifies which load_ids exist in the incoming dataset.
  3. It selects only those load_ids that have not yet been processed (i.e., missing from the processed_load_ids table).
  4. Once all transformations are complete, the processed_load_ids table is updated with the processed load_ids.
  5. The cache saves the processed_load_ids table to the output dataset after each run.
  6. When syncing the input dataset, the cache reloads the processed_load_ids table from the output dataset (if available).

This mechanism allows incremental transformations to function seamlessly, even on ephemeral machines, where the cache is not retained between runs.

This demo works on codespaces. Codespaces is a development environment available for free to anyone with a Github account. You'll be asked to fork the demo repository and from there the README guides you with further steps.
The demo uses the Continue VSCode extension.

Off to codespaces!

DHelp

Ask a question

Welcome to "Codex Central", your next-gen help center, driven by OpenAI's GPT-4 model. It's more than just a forum or a FAQ hub – it's a dynamic knowledge base where coders can find AI-assisted solutions to their pressing problems. With GPT-4's powerful comprehension and predictive abilities, Codex Central provides instantaneous issue resolution, insightful debugging, and personalized guidance. Get your code running smoothly with the unparalleled support at Codex Central - coding help reimagined with AI prowess.