Blog /April 22, 2025

What’s next for dlt in 2025: a simpler solution for solving complex problems

Marcin Rudolf,
Co-Founder & CTO

Since the release of 1.0.0 version of dlt we’ve grown quickly:

From 1,000 open-source customers in production to over 3,000
From over 500,000 monthly PyPI downloads to more than 1.4 million

Those numbers clearly show that our users trust us and put dlt in production. Below I’d like to share how we plan to further develop dlt and how it will benefit from a world where LLM assisted coding is the norm, data platforms are getting modular and interoperable and people in data teams more autonomous and powerful.

If you are just interested in our development roadmap for 2025, look here

Our journey from JSON handling to a comprehensive toolbox for moving data

dlt started as a tool for handling JSON documents. It was meant for the average Python user that does not want to deal with creating and evolving schemas, SQL models, backends and data engineers that control them. To be able to build for this persona we had to be extremely focused on how people interact with our tool and how the tool interacts with other tools in the Pythonic data ecosystem. We encoded those early learnings into our product principles, which I’m pretty sure are behind our amazing growth:

dlt is a library, not a platform. You integrate dlt into your code, not the other way around.
Multiply, don’t add. We prioritize automation to save users time and eliminate repetitive tasks.
No black boxes. Our open-source code is clean, intuitive, and hackable.
We do the work so our users do less. This "empathy principle" drives us to minimize the effort required by our users

Post initial release we quickly noticed that dlt is getting interest from skilled data professionals that were building custom data platforms, often surprising us with what could be done with such a simple tool. Apparently our principles for creating “normie software” worked for them too! We decided to follow this interest and focused on features and mechanisms meant for production and high performance use cases.

This culminated in releasing dlt **1.0,** with a focus on stability. It helpeddlt get adopted by large organizations and enabled platform engineers to quickly build whatever they want. We have evolved into a comprehensive Python library for moving data.

You can see in our numbers that this paid off enormously. Since the release in September 2024 we have 3x more production deployment, often at very large organizations and 2x more downloads which also correlate with production use.

dlt was made for the world that’s coming

There’s one more interesting metric: the number of custom sources our users code each month has skyrocketed, nearly tripling, indicating that something new is happening. Apparently, users who couldn’t use dlt before can now. How is this possible?

People finally cracked how to use LLMs to write decent code. With Cursor or Continue we are beyond pure chat interfaces, with MCPs and semantic rules we can finally start transferring data engineering knowledge into tools.
Python developers are now building data platforms and taking control of data infrastructure, opening it up to regular Python users. This is reflected in the growing popularity of lakehouse architectures and the rising interest in Iceberg.
High performance Python data libraries have matured and are in common use. They let you process data efficiently and in many different ways, often in a way that’s more accessible for Python users. For example instead of writing SQL, you can write pandas-like Ibis expressions that will be executed as efficiently as SQL on any backend.

It is clear to us that the wide adoption of LLM assistants and Python-friendly infrastructure radically increases the autonomy of dlt users. We can now go back to our roots and make dlt accessible to millions of regular Python users while still supporting more and more advanced use cases.

How to make solving complex data engineering problems even simpler

Here’s how we plan to go about it (and it ain’t rocket science): we’ll transfer even more data engineering knowledge (including “how to use dlt" type knowledge) into dlt itself. This way we can make dlt even simpler.

We are a modular, interoperable Python library so we automatically benefit from the growing ecosystem around us. All improvements in LLM-assisted coding and user experience innovations made by Cursor or Continue benefit dlt users out of the box. Deepseek was trained on our docs and code base and can generate dlt pipelines pretty well. Quick adoption of Anthropic’s MCP allows users to interact with dlt directly from a wide variety of code editors.

Of course we’re not just sitting around and waiting. From the very beginning we’ve been running our own LLM assistant, dhelp, and collecting interaction data from it. We recently also connected it to our Slack. In the last 4 months we’ve also been working on some things that you can try now:

mcp tools for raw data exploration and pipeline inspection. Your LLM agent can use this mcp to query schemas and data that was loaded into your destination.
dlt and dlt+ assistants we built in partnership with Continue. Continue assistants are a convenient way to add prompts, docs, resources and mcp tools (including the one above) into your dlt project.
A preview of the dlt ai command that helps you to develop REST API sources with Cursor.

The two biggest learnings from these early projects also informed our next steps and part of our 2025 roadmap:

We need to focus on the “Quality of Life” of our users, whether they use LLMs or not. Obvious QoL improvements include self-explanatory error messages with tips how to handle them, meaningful warnings and logging, intuitive way to import dependencies etc. We’ve already identified dozens of these “annoying little things” that make coding with dlt less pleasant and severely hinder the LLM feedback loop when fixing problems.
Knowledge transfer to LLMs must be broken down into many contexts, each specific to the most popular “jobs to be done”. Ingesting dlt documentation as a whole does not help LLMs generate code. Our job here is to build granular “assistants”: a combination of prompts/rules and regular library code (mcp tools) for doing specific jobs: building rest api sources, syncing databases, moving files around, debugging failed pipelines, setting up credentials etc.

Overall, our users should be able to activate the right LLM context to get meaningful help when writing dlt code and be able to feed the results back to fix any issues.

The next step is to focus on data engineering tasks. We’ve already built several experimental mcp servers to explore raw data loaded to a destination, one of which we have released. There are more possible “assistants” like that: some of them will help write code, some of them, attached to dlt pipelines and datasets, will help to run dlt in production. A few ideas:

Exploration and enrichment of raw data. There are many repetitive tasks that can be encoded in LLM assistants: fixing inconsistent date formats, detecting and annotating PII data, extracting entities from text, reshaping input data to get clean raw schemas etc.
Data modelling: a lot of tasks can be automated or encoded in prompts and regular Python code, e.g. generating source schemas for dbt star schemas or data vaults.
Observability, pipeline inspection and tracing. dlt generates detailed run traces that can be used to build monitoring dashboards, for alerting and during incident drill-downs.

At some point every aspect of a data engineer’s work (when writing code and when running data platforms) will receive some form of LLM assistance. We expect hundreds or even thousands of such specialized assistants to appear. They will become a vital part of our vision of dltHub, a place where hundreds of thousands of pipelines can be created, shared, and deployed.

Development Roadmap for 2025

TLDR;>

Our OSS roadmap is shaped by our users’ interest and the traction we observe.

We will stay strongly focused on supporting the building of data platform, particularly with lakehouse architectures, high-performance Python libraries and open table formats.
We are making dlt simpler and accessible to millions of regular Python user by enabling them to use LLMs for writing code and doing data engineering tasks.

Increasing Quality of Life, enabling LLM assisted coding

As mentioned above, a significant chunk of our effort will go into making dlt easier and more pleasant to use: with and without LLM assists. Initially we’ll focus on the most common use cases: building rest api sources, ingesting databases, debugging pipelines, analyzing run traces and exploring raw data.

Accessing and transforming loaded data

dlt datasets already give you unified access to the data you loaded. You can get your data from any destination in exactly the same way and expect a consistent schema and data types.

We’re particularly happy with how we combined duckdb , ibis and arrow to give you this type of access not only to parquet, json or csv files, but also to delta and iceberg tables.

Now we’re turning things around: making the data in the destination a source you can efficiently move elsewhere, transforming it along the way. You’ll be able to work not only with data frames in Python, but also write data models in SQL or Ibis. The whole concept allows various transformation engines, e.g. dbt, SQLMesh or yato, to be plugged into dlt .

Support for nested types

High performance data processing in Python happens in the Arrow format and Arrow represents nested data as structured nested types. Currently nested types cannot be formally defined, they are just “tolerated” and transparently passed along under the json type. We plan to make nested types a primary citizen, support it wherever possible, enable schema evolution and make our normalizers handle them in a unified way.

Unifying data normalizers and make them faster

Arrow and Python object normalizers behave differently. With the introduction of fully fledged nested types we’ll need to unify those behaviors. This will require re-writing the current slow relational normalizer. When using duckdb or polars we can make it way faster and multithreaded without significant backwards compatibility issues.

Pipeline state and schema storage abstraction

Currently state and schemas are stored together with the data. I think this is really good idea that simplifies production setups and improves consistency. In some cases, however, a centralized data catalog or state storage may be useful. We want our users to be able to use their own storage for pipeline state and schemas.

Full data lineage and schema abstraction

dlt schemas do not store the original identifiers and data locations of the data source (ie. JSONPath for each column of a table). Those are required to enable full data lineage and to change naming conventions without information loss. We are also experimenting with different ways to represent schemas, ie. instead of yaml we want to offer the option to store dlt schemas as Pydantic models or data classes.

API Connector Revolution

Julian Alves and dlt: when expertise meets simplicity

Our journey from JSON handling to a comprehensive toolbox for moving dataLink icon

dlt was made for the world that’s comingLink icon

How to make solving complex data engineering problems even simplerLink icon

Development Roadmap for 2025Link icon

TLDR;>Link icon

Increasing Quality of Life, enabling LLM assisted codingLink icon

Accessing and transforming loaded dataLink icon

Support for nested typesLink icon

Unifying data normalizers and make them fasterLink icon

Pipeline state and schema storage abstractionLink icon

Full data lineage and schema abstractionLink icon

Our journey from JSON handling to a comprehensive toolbox for moving data

dlt was made for the world that’s coming

How to make solving complex data engineering problems even simpler

Development Roadmap for 2025

TLDR;>

Increasing Quality of Life, enabling LLM assisted coding

Accessing and transforming loaded data

Support for nested types

Unifying data normalizers and make them faster

Pipeline state and schema storage abstraction

Full data lineage and schema abstraction