Blog /October 10, 2024

dlt-SQLMesh generator: A case of metadata handover

Adrian Brudaru,
Co-Founder & CDO

SQLMesh is a next-generation data transformation and modeling framework. It aims to be easy to use, correct, and efficient and is maintained by the Tobiko Data team. It helps you scalably, reliably and safely modify your data pipelines because it understands SQL and can make intelligent updates instead of stringing scripts together. SQLMesh boasts several future-proof features such as automatic data contracts, virtual data environments and snapshots, extensive change summaries (before updates are applied!) and column-level lineage out of the box.

The Need for an Integrated Data Stack

In the modern data stack, the proliferation of tools has led to a fragmented ecosystem where interoperability is more aspiration than reality. This challenge is aptly described in the Dagster blog post on impedance mismatch in data orchestration, which highlights how disparate tools struggle to communicate effectively, leading to inefficiencies and increased complexity.

dlt (Data Load Tool) emerges as an open-source data ingestion standard aiming to bridge these gaps. What sets a standard like dlt apart isn't just its functionality in isolation but its ability to interplay seamlessly with other tools.

This, combined with SQLMesh (an open-source data transformation tool), means you can begin communicating important metadata information across the data pipeline. Making the lives of data practitioners easier and the pipelines more transparent.

The Modern Data Stack doesn't talk to itself

The current landscape of the modern data stack is like a tower of Babel. Each tool operates in its own silo, and while they might read and write to the same databases, they don't share metadata effectively.

A pipeline made of silos that barely talk to each other

Most tools interact only through the database layer, treating it as a universal translator. However, this approach is limited because it doesn't capture the rich metadata that could enable more intelligent data processing and integration. Without end-to-end metadata flow, the promise of a cohesive data pipeline remains unfulfilled.

As a result, data engineers repeatedly redefine schemas and semantics across different tools, leading to inefficiencies and increased potential for errors.

A better way

Imagine a world where tools within the modern data stack work as one, sharing metadata and orchestrating complex data transformations without manual intervention.

Open-source standards like dlt and SQLMesh make this possible because they are inherently designed to be interoperable and hackable. Why put out a fire with a cup when you could use a hose?

Your pipeline should be more like a hose, passing data and metadata from start to finish without impedance.

By adhering to open standards, tools can expose their internal metadata in a way that other tools can consume and act upon. This fosters an ecosystem where data ingestion, transformation, and analysis tools work in harmony, reducing the burden on data engineers to maintain glue code and manual integrations.

SQLMesh + dlt: Towards an Integrated Data Stack

SQLMesh is a powerful open-source framework that simplifies SQL-based data transformation, making it easier for data platform engineers to build, version, and manage complex data workflows. With features like version control, data lineage tracking, and easy testing, SQLMesh empowers engineers to develop with confidence, speed up iteration, and maintain data quality—all critical for managing and scaling modern data platforms. This makes SQLMesh a favorite among data platform engineers who are looking for a reliable and efficient way to handle transformations without getting bogged down in operational complexity.

The metadata handover in action

The recent pull request on GitHub showcases an integration where SQLMesh can generate project scaffolding by inspecting a dlt pipeline's schema and configuration.

Faster scaffolding

By reading the metadata provided by dlt, SQLMesh can automatically generate the necessary models and configurations required for data transformation tasks. This automation not only accelerates the setup process but also reduces the likelihood of human error. Data engineers no longer need to manually define schemas or write boilerplate code to integrate data from dlt into SQLMesh.

Incremental processing support

One of the key advantages of this integration is the ability to perform incremental data processing and loading. SQLMesh leverages the metadata about data changes captured by dlt to process only new or updated records. This incremental approach is more efficient and scalable, especially with large datasets, as it avoids redundant processing of unchanged data.Benefits of SQLMeshOnce you ingest data from the dlt pipeline into your SQLMesh project, you can leverage several key benefits:

Semantic Understanding: Detects SQL issues at compile time, offers column lineage and multi-engine execution.
Smart Change Categorization: Automatically identifies changes as “breaking” or “non-breaking” to optimize backfill processes.
Automatic DAG Generation: Generates dependency graphs by parsing SQL or Python scripts.
Virtual Data Environments: Utilizes views for easy rollbacks/roll-forwards; includes validation tools like unit tests, audits, and data diff.

Why Use SQLMesh and dlt Together

Using SQLMesh and dlt together brings clear advantages for data engineers. Automation of transformation models and configurations boosts efficiency, letting engineers concentrate on complex tasks instead of repetitive coding. Incremental processing ensures pipelines handle growing data volumes smoothly without performance drops.

Data platform engineers prefer SQLMesh and dlt because they uphold true open-source principles, integrate effortlessly into existing systems, leverage metadata-driven approaches to enhance data quality and consistency, benefit from active community support, and offer the flexibility and control needed to manage data infrastructure effectively.

Try It Yourself

For a deeper dive into how SQLMesh integrates with dlt, refer to the pull request on GitHub. To experiment with the generator and see how it can streamline your data engineering workflows, check out the SQLMesh documentation. If you have questions or need assistance, consider joining the SQLMesh Slack community for support from fellow data engineers.

A message from dltHub

Are you working with portable compute or lakehouses? We are working on a portable data lake[house]. Read more about it here and let us know if you want early access to give feedback and influence our direction.

Portable data lake: A development environment for data lakes

The Need for an Integrated Data StackLink icon

The Modern Data Stack doesn't talk to itselfLink icon

A better wayLink icon

SQLMesh + dlt: Towards an Integrated Data StackLink icon

The metadata handover in actionLink icon

Faster scaffoldingLink icon

Incremental processing supportLink icon

Why Use SQLMesh and dlt TogetherLink icon

Try It YourselfLink icon

A message from dltHubLink icon