Blog /February 11, 2025

Stop writing SQL models and let your data pipeline do it automatically

Aman Gupta,
Jr. Data Engineer

Metadata in this article refers to automatically inferred data properties like schema and lineage.

If you’ve ever spent hours debugging SQL models due to manual metadata handling, you know how painful it is. Imagine eliminating a chunk of the manual metadata work. That’s what dlt and SQLMesh do, automate metadata, reduce errors, and give you back three to five hours per pipeline.

Instead of wasting hours translating metadata manually, imagine your pipeline doing it for you, instantly.

The secret sauce - Metadata driven automation

Normally, after building a dlt pipeline, you’d move over and start repeating that logic in SQLMesh. Instead, we can run the dlt pipeline and use the metadata that dlt infers from the run, such as the data’s schema, to create a scaffold of SQL in SQLmesh.

This means that your data pipeline could write its own SQL models. With a single command:

sqlmesh init -t dlt --dlt-pipeline <pipeline_name> dialect

You can automatically generate SQL models that serve as a starting point for your transformations. These models mirror your ingested data structure while remaining fully customizable, saving a few hours of time and reducing the risk of errors compared to manually creating models from scratch.

SQLMesh models

With the sqlmesh init -t dlt command, SQLMesh creates models with proper incremental processing and schema-change awareness. It recognizes breaking changes versus non-breaking changes, typically something engineers have to manage manually.

The Bluesky API Example

Inspired by David SJ’s post, this approach transforms raw data from Bluesky API into structured SQL models.

Starting point - Raw data from the Bluesky API:

For this demonstration, we used dlt’s REST API source to pull follower and follow data from the Bluesky API.

import duckdb
import dlt
from dlt.sources.helpers.rest_client import RESTClient
from dlt.sources.helpers.rest_client.paginators import JSONResponseCursorPaginator

# Set blusky handle
actor = "chess.com"

# Bluesky API Client
bluesky_client = RESTClient(
    base_url="<https://public.api.bsky.app/xrpc/>",
    paginator=JSONResponseCursorPaginator(cursor_path="cursor", cursor_param="cursor"),
)

# Fetch Followers
@dlt.resource
def get_followers(actor: str):
    for page in bluesky_client.paginate(
        "app.bsky.graph.getFollowers",
        params={"actor": actor, "limit": 100},
    ):
        yield page

# Fetch Follows
@dlt.resource
def get_follows(actor: str):
    for page in bluesky_client.paginate(
        "app.bsky.graph.getFollows",
        params={"actor": actor, "limit": 100},
    ):
        yield page

# Initialize dlt pipeline with DuckDB
pipeline = dlt.pipeline(
    pipeline_name="bluesky_pipeline",
    destination="duckdb",
    dataset_name="blusky_dataset",
)

# Run and store data in DuckDB
pipeline.run(get_followers(actor), table_name="followers", write_disposition="append")
pipeline.run(get_follows(actor), table_name="follows", write_disposition="append")

Generating SQL models:

After executing the above script, run the following command: SQLMesh automatically produced versioned SQL models for follows and followers. This turned raw API data into SQL models, no manual SQL required.

sqlmesh init -t dlt --dlt-pipeline bluesky_pipeline duckdb

Behind the scenes

Metadata-driven architecture

The pipeline automatically captures metadata during ingestion, which SQLMesh uses to understand your data structure. This handover of metadata makes the generation of SQL models possible by inferring the schema automatically from the pipeline run.

Why does this matter?

Instead of manually handling metadata, dlt + SQLMesh streamlines the process:

This means you can eliminate tedious, error-prone tasks and focus on refining your SQL models instead of maintaining them manually.

Technical deep dive

Anatomy of generated models

When SQLMesh auto-generates SQL models from your dlt pipeline’s metadata, it produces SQL for your data. Let's break down a typical generated model and highlight its key features:

The image displays models generated by the sqlmesh init -t dlt command. Data types inferred by dlt are propagated to the SQL model, providing additional type checking and opportunities for managing updates or merges through a staging layer.

Incremental processing making life easier

dlt tags table load times with a _dlt_load_id, a Unix timestamp. This is reused in the generated SQL models to keep track of the data increments that have already been transformed. It means you get more time focus on adjusting your SQL models instead of setting up incremental loading.

How `_dlt_load_id` is propogated?

Usually, only parent tables have _dlt_load_id. The incremental_followers__labels model inherits it from its parent table followers through a join condition matching _dlt_parent_id with _dlt_id (read more about it here). This inheritance is taken care by SQLMesh automatically and enables incremental processing.

You can customize the incremental load window in your config.yaml.

SQLMesh also creates system tables: _auto_restatements, _environments, _intervals, etc. to track orchestration and versioning (learn more here).

Making it work for you

Tired of manually managing metadata? Install dlt & SQLMesh now and free up your time:

pip install dlt
pip install sqlmesh

Start with a simple pipeline. Once it’s set up, run this one command and watch your pipeline generate SQL models automatically:

sqlmesh init -t dlt --dlt-pipeline my_pipeline postgres

This generates SQL models, just like the method described above.

Best Practices

Keep your dlt pipeline focused on data extraction and enrichments.
Use SQLMesh’s virtual environments for safe production and testing. Read more about environments here.
Version-control your pipeline code.
Use the _dlt_load_id column or other suitable columns like created_at or updated_at for incremental loading, avoiding full reloads.

Common Gotchas

Always run your dlt pipeline first to generate the metadata.
Nested structures may need manual optimization.
Though most schema changes are handled automatically, backward incompatible modifications, such as column renaming, still require manual intervention.

Community and open Source

Both dlt and SQLMesh are open source and have active GitHub communities:

Check out the SQLMesh examples for more sample projects, and join the growing communities to exchange tips and insights.

Try it Yourself

Want to explore further?

Check out the GitHub repositories for both dlt and SQLMesh. You can also look into the SQLMesh examples repository for several standalone projects that help you get started.

Stop wasting time on manual metadata! Try it now, and tell us in our Slack community how much time it saved you.

We’d love to hear your experience!

Real-time Data Replication with Debezium and Python

What is dlt+ Cache?

The secret sauce - Metadata driven automationLink icon

SQLMesh modelsLink icon

The Bluesky API ExampleLink icon

Starting point - Raw data from the Bluesky API:Link icon

Generating SQL models: Link icon

Behind the scenesLink icon

Metadata-driven architectureLink icon

Why does this matter? Link icon

Technical deep diveLink icon

Anatomy of generated modelsLink icon

Incremental processing making life easierLink icon

How _dlt_load_id is propogated?Link icon

Making it work for youLink icon

Best PracticesLink icon

Common GotchasLink icon

Community and open SourceLink icon

Try it YourselfLink icon

The secret sauce - Metadata driven automation

SQLMesh models

The Bluesky API Example

Starting point - Raw data from the Bluesky API:

Generating SQL models:

Behind the scenes

Metadata-driven architecture

Why does this matter?

Technical deep dive

Anatomy of generated models

Incremental processing making life easier

How `_dlt_load_id` is propogated?

Making it work for you

Best Practices

Common Gotchas

Community and open Source

Try it Yourself