LLM-native workflow
Overview
This guide walks you through a collaborative AI-human workflow for extracting and exploring data from REST API sources using an AI editor/agent of your choice and dlt. It introduces the first workflow available in dltHub workspace — an LLM-native development environment for data engineering tasks.
You will learn:
- How to initialize a dltHub workspace for your source using dltHub’s LLM-context database.
- How to build a REST API source in minutes with AI assistance.
- How to debug a pipeline and explore data using the pipeline dashboard.
- How to start a new notebook and work with the pipeline’s dataset in it.
Prerequisites
Have one of the following AI editors/agents installed:
Concepts used in this guide
Before diving into the workflow, here’s a quick overview of key terms you’ll encounter:
-
dlt workspace - An environment where all data engineering tasks, from writing code to maintenance in production, can be executed by a single developer:
- Develop and test data pipelines locally
- Run dlt pipelines, transformations, and notebooks with one command
- Deliver live, production-ready reports with streamlined access to the dataset
We plan to support more functionality in the future, such as:
- Deploy and run your data workflows in the cloud without any changes to code and schemas
- Maintain pipelines with a Runtime Agent, customizable dashboards, and validation tests
- Deploy live reports without worrying about schema drift or silent failures
-
Cursor - An AI-powered code editor that lets you express tasks in natural language for an LLM agent to implement. Cursor is the first AI code editor we’ve integrated with, so the examples use Cursor, but the same workflow applies to Continue, Copilot, Cody, Windsurf, Cline, Claude, Amp, and Codex (only the UI/shortcuts differ).
-
LLM-context - A curated collection of prompts, rules, docs, and examples provided to an LLM for specific tasks. A rich context leads to more accurate, bug-free code generation. dltHub provides tailored LLM-contexts for 1,000+ REST API sources, so you can go from idea to working pipeline in under 10 minutes.
Setup
Setup your AI editor/agent
1. Use the right model
For best results, use newer models. For example, in Cursor we’ve found that Claude-4-sonnet performs best (available in the paid version). Older or weaker models often struggle with context comprehension and workflows.
2. Add documentation
AI code editors let you upload documentation and code examples to provide additional context. The exact steps vary by tool, but here are two examples:
- Cursor (guide): Go to
Settings > Indexing & Docs
to add documentation. - Continue (guide): In chat, type
@Docs
and pressEnter
, then clickAdd Docs
.
For any editor or agent, we recommend adding documentation scoped to a specific task. At minimum, include:
- REST API source as
@dlt_rest_api
- Core dlt concepts & usage as
@dlt_docs
Install dlt workspace
pip install "dlt[workspace]"
Initialize workspace
dltHub provides prepared contexts for 1000+ sources, available at https://dlthub.com/workspace. To get started, search for your API and follow the tailored instructions.
To initialize a dltHub workspace, execute the following:
dlt init dlthub:{source_name} duckdb
This command will first prompt you to choose an AI editor/agent. If you pick the wrong one, no problem. After initializing the workspace, you can delete the incorrect editor rules and run dlt ai setup
to select the editor again. This time it will only load the rules.
The dltHub workspace will be initialized with:
- Files and folder structure you know from dlt init
- Documentation scaffold for the specific source (typically a
yaml
file) optimized for LLMs - Rules for
dlt
, configured for your selected AI editor/agent - Pipeline script and REST API source (
{source_name}_pipeline.py
) definition that you'll customize in the next step
If you can't find the source you need, start with a generic REST API source template. Choose a source name you need i.e.
dlt init dlthub:my_internal_fast_api duckdb
This will generate the full pipeline setup, including the script (my_internal_fast_api_pipeline.py
) and all the files and folders you’d normally get with a standard dlt init.
To make your source available to the LLM, be sure to include the documentation in the context so the model can understand how to use it.
Create dlt pipeline
Generate code
To get started quickly, we recommend using our pre-defined prompts tailored for each API. Visit https://dlthub.com/workspace and copy the prompt for your selected source. Prompts are adjusted per API to provide the most accurate and relevant context.
Here's a general prompt template you can adapt:
Please generate a REST API source for {source} API, as specified in @{source}-docs.yaml
Start with endpoints {endpoints you want} and skip incremental loading for now.
Place the code in {source}_pipeline.py and name the pipeline {source}_pipeline.
If the file exists, use it as a starting point.
Do not add or modify any other files.
Use @dlt_rest_api as a tutorial.
After adding the endpoints, allow the user to run the pipeline with python {source}_pipeline.py and await further instructions.
In this prompt, we use @
references to link source specifications and documentation. Make sure Cursor (or whichevert AI editor/agent you use) recognizes the referenced docs.
For example, see Cursor’s guide to @ references.
@{source}-docs.yaml
contains the source specification and describes the source with endpoints, parameters, and other details.@dlt_rest_api
contains the documentation for dlt's REST API source.
Add credentials
Prompt the LLM for credential setup instructions and add them to your workspace secrets file .dlt/secrets.toml
.
Run the pipeline
Run your pipeline:
python {source}_pipeline.py
Expected output:
Pipeline {source} load step completed in 0.26 seconds
1 load package(s) were loaded to destination duckdb and into dataset {source}_data
The duckdb destination used duckdb:/{source}.duckdb location to store data
Load package 1749667187.541553 is LOADED and contains no failed jobs
If the pipeline fails, pass error messages to the LLM. Restart after 4-8 failed attempts.
Validate with pipeline dashboard
Launch the dashboard to validate your pipeline:
dlt pipeline {source}_pipeline show
The dashboard shows:
- Pipeline overview with state and metrics
- Data schema (tables, columns, types)
- Data itself, you can even write custom queries
The dashboard helps detect silent failures due to pagination errors, schema drift, or incremental load misconfigurations.
Use the data in a notebook
With the pipeline and data validated, you can continue with custom data explorations and reports. You can use your preferred environment, for example, Jupyter Notebook, Marimo Notebook, or a plain Python file.
For an optimized data exploration experience, we recommend using a Marimo notebook. Check out the detailed guide on using dlt with Marimo.
To access the data, you can use the dataset()
method:
import dlt
my_data = dlt.pipeline("{source}_pipeline").dataset()
# get any table as Pandas frame
# my_data.{table_name}.df().head()
For more, see the dataset access guide.
Next steps: production deployment
Addon: bring your own LLM scaffold
LLMs can infer a REST API source definition from various types of input, and in many cases, it’s easy to provide what’s needed.
Here are a few effective ways to scaffold your source:
- FastAPI (Internal APIs). If you're using FastAPI, simply add a file with the autogenerated OpenAPI spec to your workspace and reference it in your prompt.
- Legacy code in any programming language. Add the relevant code files to your workspace and reference them directly in your prompt. LLMs can extract useful structure even from older codebases.
- Human-readable documentation. Well-written documentation works too. You can add it to your AI editor docs and reference it in your prompt for context.