Skip to main content

The return of ETL in the Python age

· 6 min read
Adrian Brudaru
info

PSSSST! You do ELT, right? not ETL? asking for a friend...

ETL vs ELT? A vendor driven story.

One of the earliest tooling for "ETL" data was Pentaho Kettle. Kettle stands for "Kettle Extraction Transformation Transport Load Environment" and signifies that it transforms the data before loading it. It was usually used to load data which was later transformed in SQL via "SQL scripts", while still in the tool, or via database triggers or views outside of the tool.

Indeed, the tool creators imagined some folks would write java to transform before loading, but the vast majority of data users just wanted to use SQL.

Sounds familiar? This is not so different to today's "ELT", is it?

Why did we call it ELT?

The people

Well, first of all SQL is much more accessible and very powerful for transforming tables, columns and rows - where programming handles single values. So before purpose built tooling existed, data people were already doing the transform in SQL - it just made sense.

The "EL" vendors

In the decade following Pentaho, Saas solutions started offering pipelines that load data into your database, removing the option for you to tinker with it before loading. For this reason, they would call it "ELT".

The db vendors

The concept also resonated with MPP DBs (massive parallel processing), such as Snowflake, Redshift, Bigquery, which were more than happy to encourage doing all the compute on their side.

The "T in ELT" vendors

Another puzzle piece was dbt, a tool purpose built for SQL transform. So if there's a question of ETL or ELT, dbt can only answer ELT. In dbt's word view, data starts dirty in your warehouse, where you "rename, cast, join, enrich" - a true ELT. To make the drudgery of data cleaning in SQL easier, dbt offers some python support to enable generating some of the typing and renaming SQL. They also offer a litte bit of python support for scalar operations in some db vendor systems.

What do we really do?

Most of us do a little bit of both - we extract with python, and the next steps are loading, cleaning and curation. In some cases, cleaning and curation are optional. For example, when we load a report from another platform we will probably not need to clean or curate anything.

Where do we clean data?

Data cleaning usually refers to normalising the data into correct types, usable names, etc. Doing this in SQL results in writing a lot of manual code that needs to be maintained. On the other hand, sturcturing data in python isn't easy either, it's just less technically difficult, but when metadata is missing, it becomes guesswork.

So, technically the easier place to clean data is in python, but likely the majority will do it in SQL as they are more practiced in SQL.

Where do we transform data?

When it comes to working with tables, SQL is still the better place to be. Joins and aggregations are the core operations that will happen here and they would be much harder to handle scalably in python.

dlt puts the small t back in EtlT, let's see how.

So, python is still superior at a few operations

  • Typing, renaming, normalising, unpacking
  • complex scalar operations

While we will leave the aggregations and joins to the big T, SQL.

Normalisation, typing, unpacking

dlt does this well out of the box. Automatic typing, renaming, flattening, and ddl deployment are all handled by the schema inference and evolution engine. This engine is configurable in both how it works and what it does, you can read more here: Normaliser, schema settings

Here is a usage example (it's built into the pipeline):


import dlt

# Json, dataframes, iterables, all good
# the data will be auto typed and normalised
data = [{'id': 1, 'name': 'John'}]

# open connection
pipe = dlt.pipeline(destination='bigquery',
dataset_name='raw_data')

# self-explanatory declarative interface
job_status = pipe.run(data,
write_disposition="merge",
primary_key="id",
table_name="users")

# optionally load schema and metadata
pipe.run([job_status],
write_disposition="append",
table_name="loading_status")

Scalar operations

Sometimes we need to edit a column's value in some very specific way for which SQL doesn't quite cut it. Sometimes, we have data we need to pseudonymise before loading for regulatory reasons.

Because dlt is a library, it means you can easily change how the data stream is produced or ingested. Besides your own customisations, dlt also supports injecting your transform code inside the event stream, see an example here

Here is a code example of pseudonymisation, a common case where data needs to be transformed before loading:

import dlt
import hashlib

@dlt.source
def dummy_source(prefix: str = None):
@dlt.resource
def dummy_data():
for _ in range(3):
yield {'id':_, 'name': f'Jane Washington {_}'}
return dummy_data(),

def pseudonymize_name(doc):
'''
Pseudonmyisation is a deterministic type of PII-obscuring
Its role is to allow identifying users by their hash,
without revealing the underlying info.
'''
# add a constant salt to generate
salt = 'WI@N57%zZrmk#88c'
salted_string = doc['name'] + salt
sh = hashlib.sha256()
sh.update(salted_string.encode())
hashed_string = sh.digest().hex()
doc['name'] = hashed_string
return doc


# 1. Create an instance of the source so you can edit it.
data_source = dummy_source()
# 2. Modify this source instance's resource
data_resource = data_source.dummy_data().add_map(pseudonymize_name)
# 3. Inspect your result
for row in data_resource:
print(row)
#{'id': 0, 'name': '96259edb2b28b48bebce8278c550e99fbdc4a3fac8189e6b90f183ecff01c442'}
#{'id': 1, 'name': '92d3972b625cbd21f28782fb5c89552ce1aa09281892a2ab32aee8feeb3544a1'}
#{'id': 2, 'name': '443679926a7cff506a3b5d5d094dc7734861352b9e0791af5d39db5a7356d11a'}

pipeline = dlt.pipeline(pipeline_name='example', destination='bigquery', dataset_name='normalized_data')
load_info = pipeline.run(data_resource)

The big T

Finally, once you have clean data loaded, you will probably prefer to use SQL and one of the standard tools. dlt offers a dbt runner to get you started easily with your transformation package.

pipeline = dlt.pipeline(
pipeline_name='pipedrive',
destination='bigquery',
dataset_name='pipedrive_dbt'
)

# make or restore venv for dbt, using latest dbt version
venv = dlt.dbt.get_venv(pipeline)

# get runner, optionally pass the venv
dbt = dlt.dbt.package(
pipeline,
"pipedrive/dbt_pipedrive/pipedrive", # or use public git "https://github.com/dbt-labs/jaffle_shop.git"
venv=venv
)

# run the models and collect any info
# If running fails, the error will be raised with full stack trace
models = dbt.run_all()

#optionally log dbt status
pipeline.run([models],
write_disposition="append",
table_name="_models_log")

In conclusion

ETL vs ELT was never really a debate. With some exceptions almost everyone transforms the data in SQL - but what they call this process depends on who's telling the story.

While it's easier to do most of the transformation in SQL, the tedious is completely automatable in python, and the dirty data doesn't need manual normalisation. With dlt, you can do ETL or ELT, or even better, both, as EtLT

Or, if you're feeling funny, you can add duckdb in the middle and go full EtLTLT where you have an additional T step in the middle for the kinds of operations that could be done locally. And afterwards you could load to operational systems to add one more L to the name :)

Fundamentally, we all agree it's all ETL, with the flavors simply designating specific sub-types.

Start using dlt today

What are you waiting for?

This demo works on codespaces. Codespaces is a development environment available for free to anyone with a Github account. You'll be asked to fork the demo repository and from there the README guides you with further steps.
The demo uses the Continue VSCode extension.

Off to codespaces!

DHelp

Ask a question

Welcome to "Codex Central", your next-gen help center, driven by OpenAI's GPT-4 model. It's more than just a forum or a FAQ hub – it's a dynamic knowledge base where coders can find AI-assisted solutions to their pressing problems. With GPT-4's powerful comprehension and predictive abilities, Codex Central provides instantaneous issue resolution, insightful debugging, and personalized guidance. Get your code running smoothly with the unparalleled support at Codex Central - coding help reimagined with AI prowess.