dltHub
Blog /

Metadata as Glue: A dlt-dbt generator

  • Adrian Brudaru,
    Co-Founder & CDO

A story about the modern data stack.

Imagine you go to a burger place and order a cheeseburger. They hand you a paper bag containing the following items:

  • A package of ready-bake flour. Just add water.
  • A raw beef patty.
  • A slice of cheese.
  • A head of lettuce, a tomato, and an onion.
  • A packet of ketchup and mustard.

Technically, you have everything needed to make a cheeseburger. This scenario mirrors the current state of the modern data stack.

Data engineers become the "human middleware," stitching together disparate systems and constantly dealing with integration challenges.

So how do we transform these separate components into a delicious data pipeline? The secret ingredient is melted cheese metadata, the glue that binds layers together, enabling tools to communicate and work together well.

Capturing Metadata at Ingestion

Why is this exciting? Leveraging metadata from the start empowers data engineers to create scalable, flexible infrastructures that turn raw data into actionable insights with minimal friction.

As highlighted in "Governance, Democracy, and the Data Mesh", metadata is a strategic asset that can enable things like interoperability and complex governance out of the box.

Let’s look at what dlt can do for dlt+dbt stack.

Up and running in one command

dlt-dbt-generator <pipeline-name>

Scaffodling first.

dbt is very popular with data professionals, but setting up a project is always a bit of work. Our first step was to create a cli tool that creates dbt scaffolding from a dlt pipeline. With one command line, you get a dbt project scaffold that is ready to be built on - complete with things like source schemas, staging layer, and incremental processing columns that help dbt keep track of what was already processed.

This is enabled by dlt metadata and would not be possible based on table metadata from db alone - for example the metadata for incremental logic originates from dlt.

Dimensional modelling before loading: Early steps

But we do not stop our experiment there. We like to push and explore the boundaries of what can be done, so we created a dimensional model generator.

This generator allows you to declare fact tables and their dimensions, and generates the SQL required as a second layer.

This works great if your raw data is already in a conformed (one table per entity) format - but if your data is raw and nested, it will be hard to get a good model on first pass. However even in the raw data case it can still help to go from nothing to a full model faster.

Dimensional modelling: Possible next steps

Currently, we generate the model on raw data, but this leaves a lot to be desired if the starting data model is indeed very raw. A possible next step is to enable running the modelling on the conformed layer one would generate in dbt. So instead of going raw-staging-modelled, we could go raw→staging→conformed→ modelled. This would mean dlt needs to become aware of the dbt schema too.

By adding a conformed layer, we can standardize and harmonize the data before modeling. This layer acts as an intermediary where data from various sources is cleansed, transformed, and aligned to a consistent schema.

Learn more about this generator:

  • Check out our dbt hub packages which contain the staging layer, incremental loading and limited dimensional modelling (currently 7/8 packages are generated, you can tell from the readme)

Join the future of data pipelines

- The dlt-dbt Generator is currently available to select partners. For more information on how to leverage this innovative tool within your organization, please contact our solutions team.

- We are building dlt+, read more or sign up for early access here!