

Agents don't hallucinate. They navigate without a map. Ontology engineering is how you build one, and why every team pulling humans out of the loop needs it now.


The dltHub AI Workbench gives Claude Code a structured workflow for building data pipelines. We put it to the test with a real geopolitical question.

dlt handles schema evolution efficiently but silently. Here's how to read dlt's metadata and be informed of what's shifting in your pipeline.


A "Success" exit code only tells you the pipeline ran. Use `load_id` to join `_dlt_loads` with your source table and check if the data is actually fresh.


We're in an LLM-coding junior bubble. "It runs" isn't the senior bar. Lifecycle rigor and dependency management are.


The dlt AI Workbench transforms AI-generated "vibe coding" from an unmanaged process full of hidden risks into a mature engineering workflow that prioritizes security, current documentation, and persistent state by default.


Part of the [dltHub AI Workbench series](https://dlthub.com/blog/ai-workbench)


TL;DR: Cortex Code helps you work with data already in Snowflake. dltHub Pro gets data into Snowflake from any source, especially the ones no ETL tool covers. They operate at different layers of the stack and they are designed to hand off to each other.


Call it the MVC problem: minimum viable context. Too little and it hallucinates your domain. Too much and it drifts from your actual goal. The process has to be controlled.


How are LLMs supposed to know the business logic of how you use Hubspot, Luma and Slack together? How are they supposed to know what a customer means to you?


Today we are introducing the dltHub AI Workbench: an infrastructure layer for dltHub that makes AI-generated dlt pipelines trustworthy enough to run and deploy in production.


Stop PII leaks before they hit your warehouse. By using dlt and Pydantic to enforce data contracts, you can sanitize or quarantine sensitive fields the moment they’re ingested.


In this blog post, I will describe the actual, hard real world barriers that make your LLM setup collapse, and propose principles for making your systems work.


Add data quality gates to Microsoft Fabric with dlt. Validate schemas, catch bad records, and mask PII before data reaches your lakehouse and downstream analytics.


Production traces are scattered across databases, log aggregators, and storage buckets, and most of them aren't clean (input, output) pairs you can hand to a training job. This walkthrough shows how to build a dlt pipeline that extracts traces from any source, transforms them into structured conversation formats, and lands them as versioned Parquet on Hugging Face, ready for Distil Labs to generate synthetic training data and deliver a specialist model that beats the LLM you're running today.




From raw data to production ML: load, transform, embed, and publish curated datasets with declarative pipelines powered by dltHub.


Single-gate validation fails to decouple row-level syntax from batch-level semantics. Evolve from WAP to the AWAP protocol with this simple dlt tutorial to stop pipeline corruption at the source.


Trying to force an LLM to reconstruct the 'world' using only a semantic layer is like trying to turn cheese back into milk. The information required to understand the original system was stripped away during the modeling process.


For the more classic data engineering crowd, here’s an explainer of how unstructured AI memory works, though the lens of what we know from working with structured data.


By upgrading only the generative model, we achieved a 3x accuracy boost but hit a hard ceiling, proving that not only LLMs are needed for good retrieval.




I didn't vibe-build a product. I wrote a messy scaffold that runs a pipeline, grabs the schema, and forces an agent to build a star schema. It works shockingly well.


Analyzing UFC greatness by building a full stack (dlt, dbt, Metabase) to transform raw fight stats into a data-driven search for the true GOAT.


Moved 5M rows from DuckDB to MySQL 3.7x faster, reducing time from 344s to 92s by switching from SQLAlchemy’s row-by-row path to Arrow + ADBC’s columnar pipeline.


We were told that democratization meant 'safety,' but all we got were expensive cages. The era of the SaaS hostage is ending; the era of the sovereign Builder has begun.


The “data is oil” era is over. With LLMs, data is plutonium: powerful, toxic. Shift left and secure the reactor with 5 quality pillars.


Our docs RAG was failing quietly. We stopped guessing and built a real-user evaluation: the first baseline we could actually measure and improve.




11 practical, copy-paste data quality recipes for dlt. From schema freezes to alerts, learn how to keep pipelines clean, safe, and production-ready


Start local with DuckLake, validate your data, then deploy to MotherDuck in minutes. Same pipeline, same code, just switch the destination.


Data contracts keep systems predictable by pairing clear rules with checks that catch bad data before it flows downstream.


Most LLM runs don’t fail. They converge fast, and the secret isn’t smarter models but better scaffolds that guide the work instead of forcing it.


Openflow and dltHub represent two distinct but valuable visions for the future of data ingestion.


This is, we’re told, the great democratization of data engineering. The tedious work is gone. The barrier to entry is gone. Everyone can now be a data engineer.


MotherDuck lands in Europe with serverless DuckDB warehousing. dlt adds DuckLake support, giving EU teams a fast, modern data stack.


SAP data is hard to extract. Dominik’s new Python connector replaces pyRFC, enabling faster, chunked ingestion into modern pipelines.


LLM leaders agree: the true win is "scaled mediocrity." We're empowering teams with good enough tools for massive, real-world impact.


For quick tasks, df.to_sql() is perfect. But for production pipelines, it quickly shows its limits when data volume, frequency, and schema change.


Learn how dlt automates SCD2 for nested JSON data without complex SQL headaches. Real BigQuery benchmarks show incremental loading cuts costs by 25-35%.


Emmanuel built a slim framework on top of dlt that levels up the vanilla Kafka source into a production-ready setup. Check it out 🚀


You want connectors, and you want them to be many, high quality and customisable? A man can dream? here’s our roadmap to making those dreams a reality, and how you can help us today.




We compared dlt and Sling for data ingestion performance, cost, and flexibility. See how they stack up and which might suit your data needs best.


Ajay Moorjani turned a deceptively simple JSON to Snowflake task into a rock solid pipeline using dlt, dbt, and Airflow, built in less than a coffee break.


Leveraging AI to build a dlt extract and load of coldplay data from spotify and visualize it in Visivo.


Built another pipeline just to keep a dashboard alive? Then it broke again? Michael Shoemaker shows how dlt makes API pipelines fix themselves, no drama.


We’re excited to announce that we’re building dltHub, an LLM-native data engineering platform that enables any Python developer to build, run dlt pipelines, and deliver valuable end-user-ready reports.


LLM-native scaffolds for 1000+ APIs. The IKEA moment in data engineering is here. Build pipelines with LLMs, faster and cleaner.

Using dlt + Cognee, we take API docs from Slack, PayPal, and TicketMaster and built a knowledge graph.


Dev takes Alena’s dlt course, then uses AI to build a WHOOP sleep-data pipeline, saving the data to Parquet, demonstrating that beginners can master pipelines quickly.


We've been using LanceDB for months at dltHub to build AI systems more quickly. The same setup works locally and in the cloud. Handles structured and vector data in one place.


Mixing Spark, DuckDB, and Snowflake? Iceberg decouples data, Ibis decouples logic, run your analytics anywhere, without rewrites or vendor lock-in.


Taktile cut 70% of data loading costs by shifting ingestion to Iceberg via Lambda + dlt, keeping Snowflake for analytics. Smart layers, big savings.


Singer was Stitch's incomplete competitive response to Fivetran. Meltano completed what Stitch never intended to fully open source. dlt learned from both and built the fitting abstraction for pythonic data teams.


A side-by-side look at Fivetran and dlt, covering cost models, customization, and how each approach affects team workflows as your data needs evolve.


REST API integrations come with hidden costs, pagination, schema drift, rate limits. With dlt + Cursor, you skip the boilerplate and build pipelines in minutes, not days. Less code. Less chaos. More time to build.


A hands-on guide to combining dlt and Dagster for orchestrating multi-endpoint API ingestion pipelines, with assets materialized into DuckDB. Three patterns. One powerful workflow. Plus, a peek at the new CLI and DuckDB UI.


Data engineering shouldn't require rewriting the same logic multiple times for different environments. dlt's dataset interface gives you one consistent way to work with your data, regardless of where it lives.


Ingesting to Databricks should be simple. With dlt, it finally is. No config files, no staging, just Python and go.


Vibe coding so clean, it will make your old code look bad.


Julian Alves builds reliable, simple data infrastructure. He partners with dlt to help companies create systems that deliver value, not burden.


dlt has grown from 1,000 to over 3,000 open-source users in just six months, with monthly downloads surpassing 1.4 million. This momentum reflects a growing demand for Python-native, modular, and AI-ready data tools — and dlt is building exactly that.


dlt started as a tool for handling JSON documents. It was meant for the average Python user that does not want to deal with creating and evolving schemas, SQL models, backends and data engineers that control them.


Let's stop reinventing connectors in isolation. Use LLMs to transform scattered integrations into shared, reusable solutions.


As Rakesh was exploring Fabric, dlt kept showing up in Rakesh's stack. Not by design, but because it just worked. Different projects, same ingestion layer, quietly doing its job.


I tried Vibe-coding a Singer tap (Pipedrive) into dlt and it worked, but it needed some user intervention.


Explore four ways to run dlt with Apache Airflow, from PythonOperators to KubernetesPods, and learn which setup scales best for clean, reliable pipelines.


Building pipelines with AI isn’t one task, it's many. In this post, we explore how to split and test them individually, so failures are easier to diagnose and fix.


The Write. Audit. Publish. (WAP) framework brings discipline from software engineering: write in isolation, audit for correctness, quality, and compliance, publish with confidence. But can data engineering really follow suit? Let's discuss.


Modernisation at its finest, from trash to cutting edge in seconds. It works amazing, just give it a try, stop paying for tech debt


In this microblog + video we explore generating python pipelines (dlt REST API) from Airbyte low code yaml spec. tl;dr: it works well.


Want to run SELECT * on your API data without setting up a database? dlt datasets let you query API data using SQL without setting up a database or data warehouse. They follow the Write-Audit-Publish (WAP) pattern, enabling direct SQL queries while keeping workflows efficient.


Data lakes are broken. Python + Iceberg fixes them. No lock-in. No silos. Just open, AI-ready data. Read on why and how to switch ->


We do a deep dive on the initial assistants and model context protocol (MCP) that we published on the Continue Hub


Software Engineering Has CI/CD, Data Engineering Has YOLO – Until Now


Today we announce our partnership with Continue and the release of initial two assistants, including one where developers can chat with the dlt documentation from the IDE and pass it to the LLM to help you write dlt code. Developers can also access building blocks that allow them to build their own custom assistants. In this post we want to talk about:
- why we think SaaS connector catalog black box solutions have been a dead end for LLMs
- what we have been doing so far to build for AI data engineering compound systems
- our vision for a dlt+ data infrastructure that generates trusted data that will unlock additional data engineering assistants and building blocks in future


Software engineers don’t test in production. Why are data engineers still doing it? ELT made loading easy, but debugging in the warehouse is a nightmare. dlt+ Staging fixes that.


We take the next step in our recent journey from dlt to dlt+ by releasing the initial two features of dlt+, our developer framework for running dlt pipelines in production and at scale:
- dlt+ Project: A declarative yaml collaboration point for your team.
- dlt+ Cache: A database-like portable compute layer for developing, testing and running transformations before loading.


Discover how dlt+ Cache gives data engineers a lightning-fast staging environment to test, validate, and debug transformations before they hit production!


Introducing dlt+ Project – the declarative, YAML-powered manifest that transforms data pipeline development!


This post discusses `sqlmesh init -t dlt` command that integrates dlt’s metadata with SQLMesh’s modeling capabilities. It automatically generates SQL models that accurately handle incremental processing and schema changes. Inspired by David SJ's post, this approach was demonstrated using the Bluesky API, transforming raw data into structured tables without the need for writing SQL.


When it comes to replicating operational data for analytics, Change Data Capture (CDC) is the gold standard. It offers scalability, near real-time performance, and captures all data modifications, ensuring your analytical datasets are always up-to-date.


Moving data isn’t hard because engineers lack skill. It’s hard because commoditised systems bog us down with complexity disguised as simplicity.


Tired of juggling multiple tools and formats? Discover how a single interface can simplify how you access, transform, and share your data, no matter where it lives.


Data democracy is a beautiful thing. People are more empowered, less dependent and unblocked in terms of data curiosity... However, what breaks this utopic dream is when big curious ideas, several undocumented pipelines (with perhaps with the same data) and conflicting dashboards cause confusion and indecision.


AI + dlt = 2x faster pipelines. Mooncoon 🦝 shares how Cursor IDE transforms pipeline dev. AI handles boilerplate; you ship faster. Practical workflows & a live demo inside.



2024 was a remarkable year for dltHub. Together with our users and partners, we streamlined workflows, introduced powerful capabilities, and laid a stronger foundation for the future.


If you are a data engineering consultant or run a data-focused consultancy and want to do more with less, consider joining our partner program.


dltHub is community driven in partnerships too, featuring an everybody wins model that optimises client satisfaction.
If you're excited about being part of a collaborative ecosystem that amplifies everyone's strengths while delivering exceptional value to clients, we want to hear from you.


With dlt+ and Tower, anyone who writes a bit of Python can ship production data pipelines in under an hour. Fast, open, and headache-free, this is the future of data engineering.


Europe’s Energiewende data Challenge: Decentralised cross organisational data mesh and environment portability as baseline requirements.


Data changes, let's just accept that. So how do you get in the change loop when the "left" department just won't add you in? Simple: Get yourself in the loop. Instead of shifting the responsibility to the team on the left, shift your ownership left.


SQL is key in data analysis, especially where production databases are used. We benchmarked Meltano, Airbyte, dlt, and Sling.


cognee is an open-source, scalable semantic layer for AI applications. You can now use modular ECL pipelines to connect data and reduce hallucinations.


In modern data workflows, transferring data from SQL databases to data warehouses like BigQuery, Redshift, and Snowflake is an important part of modern data workflows. And with various tools available, how do you choose the right one for your needs? We conducted a detailed benchmark test to answer this question, comparing popular tools like Fivetran, Stitch, Airbyte, and the data load tool (dlt).


Data mesh or governance is simplified when using a semantic data contract nstead of a governance api.


The current state of the ecosystem towards breaking vendor locks is best described as “incomplete”. By creating a portable data lake as a kind of framework where components are vendor agnostic, we are able to take advantage of the next developments quickly.

How Harness chooses dlt + SQLMesh to create an end-to-end next generation data platform.


Imagine you go to a burger place and order a cheeseburger. They hand you a paper bag containing the following items:
A package of ready-bake flour. Just add water. A raw beef patty. A slice of cheese. A head of lettuce, a tomato, and an onion. A packet of ketchup and mustard.
Technically, you have everything needed to make a cheeseburger. This scenario mirrors the current state of the modern data stack.