dltHub
Blog /

From Airbyte YAML to Scalable Python Pipelines with dlt (dltHub), Cursor and LLMs

  • Adrian Brudaru,
    Co-Founder & CDO
tldr: This is a first article of a series where we will explore how we can generate pipelines with LLMs. We will summarise our learnings and make them available to you, possibly in our docs.


Watch the video here:


🧩 Problem:

LLM-generated data pipelines sound promising, but in practice they struggle.

In a Linkedin poll last week, more than two-thirds of the audience reported using LLMs to help build data pipelines. However, generating a working pipeline from documentation alone is not yet reliable, because:

  • Key parameters like pagination, authentication, primary keys, and incremental strategies are often missing from API documentation.
  • LLMs can’t always infer this missing information correctly.
  • Even tools like OpenAPI specs don’t contain everything needed, especially not semantic details like how to load incrementally or how tables are linked.
So the fundamental problem: Parameter extraction is hard, and without all the pieces, LLMs can’t reliably generate working pipeline code.

🛠️ Solution Approach:

To test the limits of LLM-based codegen, I wanted to isolate the problem parts, starting with the easiest possible feature extraction, to see if the LLM would be able to build a pipeline with it.

In following explorations, we will increase the complexity of the feature extraction to understand where the current limits exist.

The steps I took for this scenario:

  1. Used an Airbyte YAML source as a starting point — because it already includes the missing pieces (primary key, incremental strategy).
  2. Fed this YAML into a Cursor project configured with custom LLM prompts and documentation context.
  3. Used GPT to convert the YAML into a Python pipeline using dlt’s REST API source.

Along the way:

  • I prepared and added the relevant REST API documentation manually, since the LLM couldn’t parse the official docs reliably. I realised this was happening because on my first attempt the LLM was not able to read information about incremental loading that I could see existed in the docs. This manual preparation was done by passing the original docs to a LLM and asking for a "LLM friendly, compressed version"

🚀 Results:

  • On a simple example API, the LLM successfully generated the dlt pipeline on the first try — including incremental loading and authentication.
  • This was surprisingly successful, given the complexity of the task.

🔑 Key Learnings:

- LLMs aren’t good at guessing missing schema logic, give them as much structured context as possible.

- Preparation of docs matters. Compress and rewrite API documentation in LLM-friendly formats to improve results.

- Custom prompts and instructions make or break the outcome. Telling the LLM “you are a data engineer, do this, don't do that” leads to much better behaviour than generic requests.

- The final mile might still be human. Some information is just not available in the docs, and some apis have gotchas.

- Cursor + dlt + LLMs can work surprisingly well, especially when grounded in existing YAML configs like Airbyte.

Resources

The source: https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-the-guardian-api/manifest.yaml

The cursor rule (crude experiment, i am sure you can do better):

You are an expert data engineer specializing in building data pipelines using the dltHub REST API source. Your primary task is to analyze documentation or other types of templates about rest apis, and fill in dltHub REST API templates efficiently. Follow these guidelines:

1. Prioritize using the python from dlt.sources.rest_api import rest_api_source, try to not write any python code, just use the rest soure templates.
2. create resources for each endpoint
3. consider the authentication, pagination for the client
4. configure resoures with data path and pagination info
5. Configure incremental loading on resources where possible: Set appropriate write_disposition (append, replace, merge) based on the data loading requirements.
6. Use dlt.pipeline to configure the pipeline, specifying the destination and dataset name.
9. Use dlt secrets for storing and accessing sensitive information like API keys.
10. Provide clear comments explaining the purpose of each resource and any complex logic.

When presented with documentation or requirements, analyze them to extract relevant information for building the pipeline. Fill in the dltHub REST API templates with the appropriate configuration, ensuring optimal performance and data integrity.

The LLM-friendly REST API documentation that I converted:

LLM-friendly REST API documentation
REST API Source - dlt Documentation (LLM-Optimized)

Overview

The REST API source extracts JSON data from RESTful APIs using declarative configurations. Configure endpoints, pagination, incremental loading, authentication, and data selection in a clear, structured way.

Quick Start Example

import dlt
from dlt.sources.rest_api import rest_api_source

source = rest_api_source({
    "client": {
        "base_url": "https://api.example.com/",
        "auth": {"token": dlt.secrets["api_token"]},
        "paginator": {"type": "json_link", "next_url_path": "paging.next"},
    },
    "resources": ["posts", {"name": "comments", "endpoint": {"path": "posts/{resources.posts.id}/comments"}}]
})

pipeline = dlt.pipeline(pipeline_name="example", destination="duckdb", dataset_name="api_data")
pipeline.run(source)

Key Configuration Sections

Client Configuration

Defines connection details to the REST API:

base_url (str): API root URL.

headers (dict, optional): Additional HTTP headers.

auth (dict/object, optional): Authentication details.

paginator (dict/object, optional): Pagination method.

Resources

List of API endpoints to load data from. Each resource defines:

name: Resource/table name.

endpoint: Endpoint details (path, params, method).

primary_key (optional): Primary key for merging data.

write_disposition (optional): Merge/append behavior.

processing_steps (optional): Data filtering and transformation steps.

Pagination Methods

The REST API supports common pagination patterns:

json_link: Next URL from JSON response (next_url_path).

header_link: Next page URL from HTTP headers.

offset: Numeric offsets (limit, offset_param).

page_number: Incremental page numbers (base_page, page_param, total_path).

cursor: Cursor-based pagination (cursor_path, cursor_param).

Custom paginators: Extendable for specialized cases.

Example Configuration:

"paginator": {"type": "page_number", "base_page": 1, "page_param": "page", "total_path": "response.pages"}

Incremental Loading

Load only new/updated data using timestamps or IDs:

Simple Incremental Example:

"params": {
    "since": {"type": "incremental", "cursor_path": "updated_at", "initial_value": "2024-01-01T00:00:00Z"}
}

Authentication

Supported methods:

Bearer Token

HTTP Basic

API Key (header/query)

OAuth2 Client Credentials

Custom authentication classes

Bearer Token Example:

"auth": {"type": "bearer", "token": dlt.secrets["api_token"]}

Data Selection (JSONPath)

Explicitly specify data locations in JSON responses:

Example:

"endpoint": {"path": "posts", "data_selector": "response.items"}

Resource Relationships

Fetch related resources using placeholders referencing parent fields:

Path Parameter Example:

{"path": "posts/{resources.posts.id}/comments"}

Query Parameter Example:

{"params": {"post_id": "{resources.posts.id}"}}

Processing Steps (Filter/Transform)

Apply transformations before loading:

"processing_steps": [
    {"filter": "lambda x: x['id'] < 10"},
    {"map": "lambda x: {**x, 'title': x['title'].lower()}"}
]

Troubleshooting

Validation Errors: Check resource structure (endpoint paths, params).

Incorrect Data: Verify JSONPaths (data_selector).

Pagination Issues: Explicitly set paginator type; check total_path correctness.

Authentication Issues: Verify credentials; ensure correct auth method.

Call to action:

You can try the above in under 1 hour. Stop wasting your time writing pipelines and try it.