Version: 1.17.1 (latest)

Transform data in Python with Arrow tables or DataFrames

You can transform your data in Python using Pandas DataFrames or Arrow tables. To get started, please read the dataset docs.

Interactively transforming your data in Python

Using the methods explained in the dataset docs, you can fetch data from your destination into a DataFrame or Arrow table in your local Python process and work with it interactively. This even works for filesystem destinations:

The example below reads GitHub reactions data from the issues table and counts the reaction types.

pipeline = dlt.pipeline(
    pipeline_name="github_pipeline",
    destination="duckdb",
    dataset_name="github_reactions",
    dev_mode=True
)

# get a dataframe of all reactions from the dataset
reactions = pipeline.dataset().issues.select("reactions__+1", "reactions__-1", "reactions__laugh", "reactions__hooray", "reactions__rocket").df()

# calculate and print out the sum of all reactions
counts = reactions.sum(0).sort_values(0, ascending=False)
print(counts)

# alternatively, you can fetch the data as an arrow table
reactions = pipeline.dataset().issues.select("reactions__+1", "reactions__-1", "reactions__laugh", "reactions__hooray", "reactions__rocket").arrow()
# ... do transformations on the arrow table

Persisting your transformed data

Since dlt supports DataFrames and Arrow tables from resources directly, you can use the same pipeline to load the transformed data back into the destination.

A simple example

A simple example that creates a new table from an existing user table but only with columns that do not contain private information. Note that we use the iter_arrow() method on the relation to iterate over the arrow table instead of fetching it all at once.

pipeline = dlt.pipeline(
    pipeline_name="users_pipeline",
    destination="duckdb",
    dataset_name="users_raw",
    dev_mode=True
)

# get user relation with only a few columns selected, but omitting email and name
users = pipeline.dataset().users.select("age", "amount_spent", "country")

# load the data into a new table called users_clean in the same dataset
pipeline.run(users.iter_arrow(chunk_size=1000), table_name="users_clean")

A more complex example

The example above could easily be done in SQL. Let's assume you'd like to actually do in Python some Arrow transformations. For this will create a resources from which we can yield the modified Arrow tables. The same is possibly with DataFrames.

import pyarrow.compute as pc

pipeline = dlt.pipeline(
    pipeline_name="users_pipeline",
    destination="duckdb",
    dataset_name="users_raw",
    dev_mode=True
)

# NOTE: this resource will work like a regular resource and support write_disposition, primary_key, etc.
# NOTE: For selecting only users above 18, we could also use the filter method on the relation with ibis expressions
@dlt.resource(table_name="users_clean")
def users_clean():
    users = pipeline.dataset().users
    for arrow_table in users.iter_arrow(chunk_size=1000):

        # we want to filter out users under 18
        age_filter = pc.greater_equal(arrow_table["age"], 18)
        arrow_table = arrow_table.filter(age_filter)

        # we want to hash the email column
        arrow_table = arrow_table.append_column("email_hash", pc.sha256(arrow_table["email"]))

        # we want to remove the email column and name column
        arrow_table = arrow_table.drop(["email", "name"])

        # yield the transformed arrow table
        yield arrow_table


pipeline.run(users_clean())

Other transforming tools

If you want to transform your data before loading, you can use Python. If you want to transform the data after loading, you can use Pandas or one of the following:

dbt. (recommended)
dlt SQL client.

Transform data in Python with Arrow tables or DataFrames

Interactively transforming your data in Python

Persisting your transformed data

A simple example

A more complex example

Other transforming tools

DHelp

Ask a question

Interactively transforming your data in Python​

Persisting your transformed data​

A simple example​

A more complex example​

Other transforming tools​

DHelp

Ask a question

Interactively transforming your data in Python

Persisting your transformed data

A simple example

A more complex example

Other transforming tools