dltHub
Blog /

An early path with to make compound AI systems work for data engineering

  • Matthaus Krzykowski,
    Co-Founder & CEO

Today we want to announce our partnership with Continue on the day of the launch of Continue 1.0. Continue 1.0 enables developers to create, share, and use custom AI code assistants with open-source IDE extensions that can now seamlessly leverage a vibrant hub of models, context, and other building blocks.

In the last couple of months we worked with Continue on assistants and building blocks that are available today in the Continue Hub:

Our initial two assistants:

  • The dlt assistant is now widely available to help dlt developers. You can now chat with the dlt documentation from the IDE and pass it to the LLM to help you write dlt code. Using tools, your preferred LLM will be able to inspect your pipeline schema and run text-to-SQL queries.
dlt assistant
  • The dlt+ assistant extends the dlt assistant with support for dlt+ Project features. This includes kick-starting dlt+ projects, managing your catalog of sources, destinations, & pipelines, and running pipeline on your behalf for a tight development feedback loop.
dlt+ project assistant

Additionally, we also release an initial set of building blocks, including two Anthropic MCP servers, that allow developers to build their own dlt and dlt+ custom assistants.

dlt and dlt+ building blocks

Developers can use our building blocks in their next custom AI code assistant

We will talk more about the user experience of these building blocks as well as these two assistants in the near future.

In this post we want to talk about:

  • why we think SaaS connector catalog black box solutions have been a dead end for LLMs
  • what we have been doing so far to build for AI data engineering compound systems
  • our vision for a dlt+ data infrastructure that generates trusted data that will unlock many additional data engineering assistants and building blocks in future

SaaS connector catalog black box solutions have been a dead end for LLMs

In the era of Large Language Models (LLMs), software development is becoming increasingly automated by AI copilots.

We have not made much progress with copilots in data engineering for two main reasons.

Most copilots to date have been black box, SaaS solutions with roughly the same components, which neither developers nor LLMs often have little to no ability to understand or improve.

Additionally most ELT tools and how they handle data have been a black box. A key black box component of data engineering SaaS solutions are “source”/”connector” catalogs. The core is a short-tail catalog market of +-20 sources (product database replication, some popular CRMs and ads APIs) with the highest profit margins and intense competition among vendors. The source catalog market, depending on the vendor, is usually up to 400 sources, with much smaller support. If you google “ETL + any source” you will find many vendor catalog pages are optimised for Google SEO. They don’t work for LLMs. LLMs can’t effectively parse connector catalogs because these catalogs are optimized for vendor SEO rather than structured data retrieval. They lack the metadata and transparency needed for AI-driven workflows.

Building for AI data engineering compound systems

In contrast, what we call “AI copilots” are much more than a single LLM. As the Berkeley Artificial Intelligence Research Lab points out, “state-of-the-art AI results are increasingly obtained by compound systems with multiple components, not just monolithic models”.

From early on our assumption was that for data engineering these compound systems have to invented. With dlt and dlt+ our aim has been to build a foundation for these compound AI systems. We want to provide them with the necessary data, metadata, i.e. everything as code to be able to perform well. And because of that, we will be able to develop tooling, such as our dlt assistant, that plug into things like the Continue Hub that enhance data engineers development workflows.

Continue 1.0 enables developers to create, share, and use custom AI code assistants with open-source IDE extensions that can now seamlessly leverage a vibrant hub of models, context, and other building blocks

Our will be able to address a much much, much bigger market. Huggingface hosts over 300k datasets as of February ‘25. We at dltHub think that the ‘real’ Pythonic ETL market is a market of hundreds of thousands of APIs and millions of datasets. This market is has not been addressed in the SaaS connector catalog black box age.

Like our friends at Continue we also believe in a future where developers are amplified, not automated. As more of software development is automated, we are seeing more human engineering time go into monitoring, maintaining, and improving the different components that make up AI software development systems. The data engineering AI compound system needs to be invented by many.

Individually we have been building AI workflows for SMBs & enterprises since 2016. Initially we established the company in 2021 to build bespoke AI data pipelines for Fortune 500 companies as a subcontractor to AI assistant company Rasa.

In March ‘23 we rebuilt dlt in such a way that it performs well with LLMs. We started building internal tooling and released our initial PoC of a OpenAPI pipeline generator in June ‘23.

By March ‘24 we started code generate 1000s of Python code snippets in our docs. Developers can take these snippets and use them as a starting point for development. LLMs can use these code snippet pages to generate dlt pipelines.

In Sep’ 24 OpenAI’s O1 preview was the first LLM to index dlt, unlocking a much better ChatGPT experience.

We also started to hear about the initial happy use cases of dlt users using LLM with Cursor. Users were able to reference our + external API docs + Existing Folders + dlt to get a new pipeline mostly set up for a new API quickly. Combined with dlt's productivity features, code AI editors such as Cursor can help developers ship dlt pipelines faster by handling routine tasks such as boilerplate code, basic error handling and schema definitions. It became clear to us that “autocomplete” is an important source of productivity gains.

An early path towards a data infrastructure that generates trusted data

Many challenges remain to make AI data engineering compound systems work. Data engineering involves operational challenges in live systems that we haven’t figured out how to translate to text and give to LLMs.

The result is lack of trust in data, especially in enterprise deployment of AI.

We partnered with Continue because they are one of the building blocks for a dlt+ data infrastructure that is LLM-friendly and that can generate trusted, production ready data with hallucinations.

We think of the two initial “dlt assistants” we released on the Continue platform today as early “applications” of the underlying foundation, the data infrastructure that we are building. We we build out this data infrastructure, we will release more assistants and building blocks.

Our vision towards a data infrastructure that generates trusted data

Let’s discuss each of these early building blocks as it will give the reader some indication about future AI workflows we think about.

  1. LLMs. LLMs have become powerful enough for coding. The LLM breakthrough is the ability to ingest large amount of text and reason about it. The simpler syntax and declarative nature of SQL enabled tangible gains in the text-to-SQL field for BI and data science.
  2. AI code editor such as Continue. LLMs need to meet the user where they work. For content teams, it’s Notion; for analysts and data scientists, it’s notebooks and web applications. For developers, it’s the IDE and terminal. For data engineers we think the “definitive” LLM interface still needs to be invented. We’ve only recently moved outside the chat interface towards autocomplete and context. Our user base often abuses Streamlit, Gradio and AI code editors. We therefore think that that developers need custom AI code assistants that integrate with their existing workflows and meet the requirements of their development environment. Continue allows developers to glue together some of the building blocks that we think are crucial for an early path towards trusted live data context such as Anthropic MCP servers, rules and context. We think these building blocks will evolve quickly and developers will need to be able to update their custom AI assistants easily.
  3. Anthropic MCP. In short, the Model context protocol (MCP) is a way to let your LLM execute code. This is the critical piece to give LLM reliable, accurate, and trustworthy context. IDEs typically provide access to code and documents, but don’t have native support for data access. Developing with the MCP standard means you can use that context across tools and even develop persistence strategy. We suspect that going forward we will see various vendors launch their own improved versions of context protocols.
  4. Iceberg Apache Iceberg is an open-source table format used to manage large datasets. It’s designed to handle huge amounts of data in data lakes, allowing for SQL-like queries and support across multiple computing engines. We think Iceberg is an emerging standard in the industry. A lot of companies with AI/Python-first data teams embrace Iceberg. We have many reasons why we are building out a custom Pythonic support for Iceberg. In this context we think we can pair our Iceberg adoption with a future Anthropic MCP server for LLM reliable, accurate, and trustworthy context of data catalogs.
  5. Data Catalogs Data catalogs, such as Databricks’ Unity Catalog and Snowflake’s Polaris Catalog, are akin organized directories for all your data. They store metadata (information about a company’s data), making it easy to search, understand, and manage data assets. Lack of trust in data has been a major factor in enterprise deployment of AI. As mentioned before, LLMs don’t work with SaaS black box “source”/”connector” catalogs and SaaS black box “source”/”connector” serve only a small market. As a reaction, more and more companies write their in-house Python (or Kafka) data pipelines. A lot of technical decision makers choose dlt because dlt pipelines offer filtered, cleaned and augmented data. We think that future compound systems need to handle the creation of business-level data aggregates and interact with data catalogs. This is one of our goals in building out dlt+, our comemmercial framework for running dlt pipelines in production. Because LLM models are nondeterministic, meaning they don’t always deliver the same outputs in response to the same inputs, they won’t be able to solve the problem of trusted data by themselves. Next steps We are excited about being part of a community that tries to solve some of the problems together. If you are a company in our space that thinks along the lines presented in this post, get in touch ! If you are developer that works with dlt, we encourage you to take a look at out initial OSS dlt assistant for all dlt developers. Developers can user our MCP server in their next custom AI code assistant on https://hub.continue.dev/dlthub We will be talking more about what we are up to in the near future.
Data Transformation

Next steps

  • We are excited about being part of a community that tries to solve some of the problems together. If you are a company who's also part of the Continue Hub and thinks along the lines presented in this post, get in touch !
  • If you are developer that works with dlt, we encourage you to take a look at out initial OSS dlt assistant and go from there.
  • If you build your own custom dlt or dlt+ assistant, then let us know about it, we are keen to hear the feedback!
    Slack, contact form

We will be talking more about what we are up to in the near future.