Metadata as Glue: A dlt-dbt generator
- Adrian Brudaru,
Co-Founder & CDO
- Enablement of portable data stacks, starting with pipelines and continuing with the dev environment.
A story about the modern data stack.
Imagine you go to a burger place and order a cheeseburger. They hand you a paper bag containing the following items:
- A package of ready-bake flour. Just add water.
- A raw beef patty.
- A slice of cheese.
- A head of lettuce, a tomato, and an onion.
- A packet of ketchup and mustard.
Technically, you have everything needed to make a cheeseburger. This scenario mirrors the current state of the modern data stack.
Data engineers become the "human middleware," stitching together disparate systems and constantly dealing with integration challenges.
So how do we transform these separate components into a delicious data pipeline? The secret ingredient is melted cheese metadata, the glue that binds layers together, enabling tools to communicate and work together well.
Capturing Metadata at Ingestion
Why is this exciting? Leveraging metadata from the start empowers data engineers to create scalable, flexible infrastructures that turn raw data into actionable insights with minimal friction.
As highlighted in "Governance, Democracy, and the Data Mesh", metadata is a strategic asset that can enable things like interoperability and complex governance out of the box.
Let’s look at what dlt can do for dlt+dbt stack.
Up and running in one command
dlt-dbt-generator <pipeline-name>
Scaffodling first.
dbt is very popular with data professionals, but setting up a project is always a bit of work. Our first step was to create a cli tool that creates dbt scaffolding from a dlt pipeline. With one command line, you get a dbt project scaffold that is ready to be built on - complete with things like source schemas, staging layer, and incremental processing columns that help dbt keep track of what was already processed.
Dimensional modelling before loading: Early steps
But we do not stop our experiment there. We like to push and explore the boundaries of what can be done, so we created a dimensional model generator.
This generator allows you to declare fact tables and their dimensions, and generates the SQL required as a second layer.
This works great if your raw data is already in a conformed (one table per entity) format - but if your data is raw and nested, it will be hard to get a good model on first pass. However even in the raw data case it can still help to go from nothing to a full model faster.
Dimensional modelling: Possible next steps
Currently, we generate the model on raw data, but this leaves a lot to be desired if the starting data model is indeed very raw. A possible next step is to enable running the modelling on the conformed layer one would generate in dbt. So instead of going raw-staging-modelled, we could go raw→staging→conformed→ modelled. This would mean dlt needs to become aware of the dbt schema too.
By adding a conformed layer, we can standardize and harmonize the data before modeling. This layer acts as an intermediary where data from various sources is cleansed, transformed, and aligned to a consistent schema.
Learn more about this generator:
- Check out a 10min usage video: https://www.youtube.com/watch?v=9HWykQd0gO4
- Keep an eye on our dbt hub packages where we will post some dbt projects we are generating. For example you can see one such package here.
Join the future of data pipelines
Unlock the full potential of your data with the dlt-dbt Generator. Embrace a metadata-driven approach to unify your data pipeline and empower your organization to make smarter, faster decisions.
Note: The dlt-dbt Generator is currently available to select partners. For more information on how to leverage this innovative tool within your organization, please contact our solutions team.