Blog /August 24, 2023

The return of ETL in the Python age

Adrian Brudaru,
Co-Founder & CDO

PSSSST! You do ELT, right? not ETL? asking for a friend...

ETL vs ELT? A vendor driven story.

One of the earliest tooling for "ETL" data was Pentaho Kettle. Kettle stands for "Kettle Extraction Transformation Transport Load Environment" and signifies that it transforms the data before loading it. It was usually used to load data which was later transformed in SQL via "SQL scripts", while still in the tool, or via database triggers or views outside of the tool.

Indeed, the tool creators imagined some folks would write java to transform before loading, but the vast majority of data users just wanted to use SQL.

Sounds familiar? This is not so different to today's "ELT", is it?

Why did we call it ELT?

The people

Well, first of all, SQL is much more accessible and very powerful for transforming tables, columns and rows - where programming handles single values. So before purpose built tooling existed, data people were already doing the transform in SQL - it just made sense.

The "EL" vendors

In the decade following Pentaho, Saas Solutions started offering pipelines that load data into your database, removing the option for you to tinker with it before loading. For this reason, they would call it "ELT".

The db vendors

The concept also resonated with MPP DBs (massively parallel processing), such as Snowflake, Redshift, and BigQuery, which were more than happy to encourage doing all the computing on their side.

The "T in ELT" vendors

Another puzzle piece was dbt, a tool purpose built for SQL transform. So if there's a question of ETL or ELT, dbt can only answer ELT. In dbt's word view, data starts dirty in your warehouse, where you "rename, cast, join, enrich" - a true ELT. To make the drudgery of data cleaning in SQL easier, dbt offers some Python support to enable generating some of the typing and renaming SQL. They also offer a little bit of Python support for scalar operations in some db vendor systems.

What do we really do?

Most of us do a little bit of both - we extract with Python, and the next steps are loading, cleaning, and curation. In some cases, cleaning and curation are optional. For example, when we load a report from another platform we will probably not need to clean or curate anything.

Where do we clean data?

Data cleaning usually refers to normalising the data into correct types, usable names, etc. Doing this in SQL results in writing a lot of manual code that needs to be maintained. On the other hand, structuring data in Python isn't easy either, it's just less technically difficult, but when metadata is missing, it becomes guesswork.

So, technically the easier place to clean data is in Python, but likely the majority will do it in SQL as they are more practiced in SQL.

Where do we transform data?

When it comes to working with tables, SQL is still the better place to be. Joins and aggregations are the core operations that will happen here and they would be much harder to handle scalably in Python.

`dlt` puts the small t back in EtlT, let's see how.

So, python is still superior at a few operations

Typing, renaming, normalising, and unpacking
complex scalar operations

While we will leave the aggregations and joins to the big T, SQL.

Normalisation, typing, unpacking

dlt does this well out of the box. Automatic typing, renaming, flattening, and ddl deployment are all handled by the schema inference and evolution engine. This engine is configurable in both how it works and what it does, you can read more here: Normaliser, schema settings

Here is a usage example (it's built into the pipeline):

import dlt

# Json, dataframes, iterables, all good
# the data will be auto typed and normalised
data = [{'id': 1, 'name': 'John'}]

# open connection
pipe = dlt.pipeline(destination='bigquery',
                    dataset_name='raw_data')

# self-explanatory declarative interface
job_status = pipe.run(data,
                      write_disposition="merge",
                      primary_key="id",
                      table_name="users")

# optionally load schema and metadata
pipe.run([job_status],
         write_disposition="append",
         table_name="loading_status")

Scalar operations

Sometimes we need to edit a column's value in some very specific way for which SQL doesn't quite cut it. Sometimes, we have data we need to pseudonymize before loading for regulatory reasons.

Because dlt is a library, it means you can easily change how the data stream is produced or ingested. Besides your own customisations, dlt also supports injecting your transform code inside the event stream, see an example here

Here is a code example of pseudonymization, a common case where data needs to be transformed before loading:

import dlt
import hashlib

@dlt.source
def dummy_source(prefix: str = None):
    @dlt.resource
    def dummy_data():
        for _ in range(3):
            yield {'id':_, 'name': f'Jane Washington {_}'}
    return dummy_data(),

def pseudonymize_name(doc):
    '''
    Pseudonmyisation is a deterministic type of PII-obscuring
    Its role is to allow identifying users by their hash,
    without revealing the underlying info.
    '''
    # add a constant salt to generate
    salt = 'WI@N57%zZrmk#88c'
    salted_string = doc['name'] + salt
    sh = hashlib.sha256()
    sh.update(salted_string.encode())
    hashed_string = sh.digest().hex()
    doc['name'] = hashed_string
    return doc


# 1. Create an instance of the source so you can edit it.
data_source = dummy_source()
# 2. Modify this source instance's resource
data_resource = data_source.dummy_data().add_map(pseudonymize_name)
# 3. Inspect your result
for row in data_resource:
    print(row)
#{'id': 0, 'name': '96259edb2b28b48bebce8278c550e99fbdc4a3fac8189e6b90f183ecff01c442'}
#{'id': 1, 'name': '92d3972b625cbd21f28782fb5c89552ce1aa09281892a2ab32aee8feeb3544a1'}
#{'id': 2, 'name': '443679926a7cff506a3b5d5d094dc7734861352b9e0791af5d39db5a7356d11a'}

pipeline = dlt.pipeline(pipeline_name='example', destination='bigquery', dataset_name='normalized_data')
load_info = pipeline.run(data_resource)

The big T

Finally, once you have clean data loaded, you will probably prefer to use SQL and one of the standard tools. dlt offers a dbt runner to get you started easily with your transformation package.

pipeline = dlt.pipeline(
    pipeline_name='pipedrive',
    destination='bigquery',
    dataset_name='pipedrive_dbt'
)

# make or restore venv for dbt, using latest dbt version
venv = dlt.dbt.get_venv(pipeline)

# get runner, optionally pass the venv
dbt = dlt.dbt.package(
    pipeline,
    "pipedrive/dbt_pipedrive/pipedrive", # or use public git "https://github.com/dbt-labs/jaffle_shop.git"
    venv=venv
)

# run the models and collect any info
# If running fails, the error will be raised with full stack trace
models = dbt.run_all()

#optionally log dbt status
pipeline.run([models],
         write_disposition="append",
         table_name="_models_log")

In conclusion

ETL vs ELT was never really a debate. With some exceptions almost everyone transforms the data in SQL - but what they call this process depends on who's telling the story.

While it's easier to do most of the transformation in SQL, the tedious is completely automatable in Python, and the dirty data doesn't need manual normalisation. With dlt, you can do ETL or ELT, or even better, both, as EtLT

Or, if you're feeling funny, you can add duckdb in the middle and go full EtLTLT where you have an additional T step in the middle for the kinds of operations that could be done locally. And afterwards you could load to operational systems to add one more L to the name :)

Fundamentally, we all agree it's all ETL, with the flavors simply designating specific sub-types.

Start using `dlt` today

What are you waiting for?

Dive into our getting started docs

Join the ⭐Slack Community⭐ for discussion and help!

Trust your data! Column and row level lineages, an explainer and a recipe.

Dumpster diving for data: The MongoDB experience

ETL vs ELT? A vendor driven story.Link icon

Why did we call it ELT?Link icon

The peopleLink icon

The "EL" vendorsLink icon

The db vendorsLink icon

The "T in ELT" vendorsLink icon

What do we really do?Link icon

Where do we clean data?Link icon

Where do we transform data?Link icon

dlt puts the small t back in EtlT, let's see how.Link icon

Normalisation, typing, unpackingLink icon

Scalar operationsLink icon

The big TLink icon

In conclusionLink icon

Start using dlt todayLink icon