🧪 Python-based transformations
This page is for dlt+, which requires a license. Join our early access program for a trial license.
🚧 This feature is under development, and the interface may change in future releases. Interested in becoming an early tester? Join dlt+ early access.
dlt+ allows you to define Arrow-based transformations that operate on a cache. The actual transformation code is located in the ./transformations
folder.
In this section, you will learn how you can define Arrow-based transformations with Python.
Generate template​
Since this feature is still under development and documentation is limited, we recommend starting with a template. You can generate one using the following command:
Make sure you have configured your cache and transformation in the dlt.yml
file before running the command below.
dlt transformation <transformation-name> render-t-layer
Running this command will create a new set of transformations inside the ./transformations
folder. The generated template includes:
- Transformation functions that manage incremental loading state based on
dlt_load_id
. - Two transformation functions that implement user-defined transformations.
- A staging view, which pre-selects only rows eligible for the current transformation run.
- A main output table, which initially just forwards all incoming rows unchanged.
If you run the generated transformations without modifying them, the execution will fail. This happens because your cache expects an aggregated table corresponding to the <transformation-name>
, but the newly created transformations do not include it. To resolve this, you can either:
- Update your cache settings to match the new transformation.
- Implement a transformation that aligns with the expected table structure.
Understanding incremental transformations​
The default transformations generated by the scaffolding command work incrementally using the dlt_load_id
from the incoming dataset. Here's how it works:
- The
dlt_loads
table is automatically available in the cache. - The transformation layer identifies which
load_id
s exist in the incoming dataset. - It selects only those
load_id
s that have not yet been processed (i.e., missing from theprocessed_load_ids
table). - Once all transformations are complete, the
processed_load_ids
table is updated with the processedload_id
s. - The cache saves the
processed_load_ids
table to the output dataset after each run. - When syncing the input dataset, the cache reloads the
processed_load_ids
table from the output dataset (if available).
This mechanism allows incremental transformations to function seamlessly, even on ephemeral machines, where the cache is not retained between runs.