Version: devel

Initialize a pipeline

This guide walks you through creating and initializing a dlt pipeline in dltHub Workspace — whether manually, with the LLM help, or from one of the verified sources maintained by dltHub team.

Overview

A dlt pipeline moves data from a source (like an API or database) into a destination (like DuckDB, Snowflake, or Iceberg). Initializing a pipeline is the first step in the data workflow. You can create one in three CLI-based ways:

Method	Command	Best for
Manual	`dlt init <source> <destination>`	Developers who prefer manual setup
LLM-native	`dlt init dlthub:<source> <destination>`	AI-assisted development with editors like Cursor
Verified source	`dlt init <verified_source> <destination>`	Prebuilt, tested connectors from the community and dltHub team

Step 0: Install dlt with workspace support

Before you start, make sure that you followed installation instructions and enabled additional Workspace features

dltHub Workspace is a unified environment for developing, running, and maintaining data pipelines — from local development to production.

More about dlt Workspace

Step 1: Initialize a custom pipeline

Manual setup (standard workflow)

A lightweight, code-first approach ideal for developers comfortable with Python.

dlt init {source_name} duckdb

for example:

dlt init my_github_pipeline duckdb

It scaffolds the pipeline template — a minimal starter project with a single Python script that shows three quick ways to load data into DuckDB using dlt:

fetch JSON from a public REST API (Chess.com as an example) with requests,
read a public CSV with pandas, and
pull rows from a SQL database via SQLAlchemy.

The file also includes an optional GitHub REST client example (a @dlt.resource + @dlt.source) that can use a token from .dlt/secrets.toml, but will work unauthenticated at low rate limits. It’s meant as a hands-on playground you can immediately run and then adapt into a real pipeline.

Learn how to build you own dlt pipeline with dlt Fundamentals course.

LLM-native setup

A collaborative AI-human workflow that integrates dlt with AI editors and agents like:

Cursor,
Continue,
Copilot,
the full list

Initialize your first workspace pipeline

dltHub provides prepared contexts for thousands of different sources, available at https://dlthub.com/workspace. To get started, search for your API and follow the tailored instructions.

search for your source

To initialize a dltHub workspace, execute the following:

dlt init dlthub:{source_name} duckdb

For example:

dlt init dlthub:github duckdb

The command scaffolds a workspace-ready REST API pipeline project with AI-assisted development support.

It creates:

A {source_name}_pipeline.py file containing a placeholder REST API source (@dlt.source) using RESTAPIConfig and rest_api_resources, preconfigured for the DuckDB destination.
A .dlt/secrets.toml file where you can store API credentials and tokens.
Dependency instructions suggesting adding dlt[duckdb]>=1.18.0a0 to your pyproject.toml.
AI assistant rule files that enable dlt ai workflows.
A {source_name}-docs.yaml file providing source-specific context for the LLM.

It will first prompt you to choose an AI editor/agent. If you pick the wrong one, no problem. After initializing the workspace, you can delete the incorrect editor rules and run dlt ai setup to select the editor again.

Generate code

To get started quickly, we recommend using our pre-defined prompts tailored for each API. Visit https://dlthub.com/workspace and copy the prompt for your selected source. Prompts are adjusted per API to provide the most accurate and relevant context.

Here's a general prompt template you can adapt:

Please generate a REST API source for {source} API, as specified in @{source}-docs.yaml
Start with endpoints {endpoints you want} and skip incremental loading for now.
Place the code in {source}_pipeline.py and name the pipeline {source}_pipeline.
If the file exists, use it as a starting point.
Do not add or modify any other files.
Use @dlt_rest_api as a tutorial.
After adding the endpoints, allow the user to run the pipeline with python {source}_pipeline.py and await further instructions.

In this prompt, we use @ references to link source specifications and documentation. Make sure Cursor (or whichever AI editor/agent you use) recognizes the referenced docs. For example, see Cursor’s guide to @ references.

@{source}-docs.yaml contains the source specification and describes the source with endpoints, parameters, and other details.
@dlt_rest_api contains the documentation for dlt's REST API source.

For more on the workspace concept, see LLM-native workflow

Verified source setup (community connectors)

You can also initialize a verified source — prebuilt connectors contributed and maintained by the dlt team and community.

List and select a verified source

List available sources:

dlt init -l

Pick one, for example:

dlt init github duckdb

Project structure

This command creates a project like:

├── .dlt/
│   ├── config.toml
│   └── secrets.toml
├── github/
│   ├── __init__.py
│   ├── helpers.py
│   ├── queries.py
│   ├── README.md
│   ├── settings.py
├── github_pipeline.py

Follow the command output to install dependencies and add secrets.

General process

To initialize any verified source:

dlt init {source_name} {destination_name}

For example:

dlt init google_ads duckdb
dlt init mongodb bigquery

After running the command:

The project directory and required files are created.
You’ll be prompted to install dependencies.
Add your credentials to .dlt/secrets.toml.

Update or customize verified sources

You can modify an existing verified source in place.

If your changes are generally useful, consider contributing them back via PR.
If they’re specific to your use case, make them modular so you can still pull upstream updates.

info

dlt includes several powerful, built-in sources for extracting data from different systems:

rest_api — extract data from any REST API using a declarative configuration for endpoints, pagination, and authentication.
sql_database — load data from 30+ SQL databases via SQLAlchemy, PyArrow, pandas, or ConnectorX. Supports automatic table reflection and all major SQL dialects.
filesystem — load files from local or cloud storage (S3, GCS, Azure Blob, Google Drive, SFTP). Natively supports CSV, Parquet, and JSONL formats.

Together, these sources cover the most common data ingestion scenarios — from APIs and databases to files.

Step 2: Add credentials

Most pipelines require authentication or connection details such as API keys, passwords, or database credentials. dlt retrieves these values automatically through config providers, which it checks in order when your pipeline runs.

Provider priority:

Environment variables – the highest priority.

export SOURCES__GITHUB__API_SECRET_KEY="<github_personal_access_token>"
export DESTINATION__DUCKDB__CREDENTIALS="duckdb:///_storage/github_data.duckdb"

.dlt/secrets.toml and .dlt/config.toml – created automatically when initializing a pipeline.
- secrets.toml → for sensitive values (API tokens, passwords)
- config.toml → for non-sensitive configuration Example:
```
[sources.github]
api_secret_key = "<github_personal_access_token>"

[destination.duckdb]
credentials = "duckdb:///_storage/github_data.duckdb"
```
Vaults – such as Google Secret Manager, Azure Key Vault, AWS Secrets Manager, or Airflow Variables.
Custom providers – added via register_provider() for your own configuration formats.
Default values – from your function signatures.

Using credentials in code

dlt automatically injects secrets into your functions when you call them. For example:

@dlt.source
def github_api_source(api_secret_key: str = dlt.secrets.value):
    return github_api_resource(api_secret_key=api_secret_key)

You don’t need to load secrets manually — dlt resolves them from any of the above providers.

Step 3: Run a pipeline

Run your script

Run the pipeline to verify that everything works correctly:

python {source_name}_pipeline.py

This executes your pipeline — fetching data from the source, normalizing it, and loading it into your chosen destination.

What you should see

A printed load_info summary similar to::

Pipeline github_api_pipeline completed in 0.7 seconds
1 load package(s) were loaded to destination duckdb and into dataset github_data
Load package 1749667187.541553 is COMPLETED and contains no failed jobs

Monitor progress (optional)

pip install enlighten
PROGRESS=enlighten python {source_name}_pipeline.py

Alternatives: tqdm, alive_progress, or PROGRESS=log.

See monitor loading progress

Inspect loads & trace

dlt pipeline {pipeline_name} info            # overview
dlt pipeline {pipeline_name} load-package    # latest package
dlt pipeline -v {pipeline_name} load-package # with schema changes
dlt pipeline {pipeline_name} trace           # last run trace & errors

Next steps: Deploy and scale

Once your pipeline runs locally:

Monitor via the workspace dashboard
Set up Profiles to manage separate dev, prod, and test environments
Deploy to runtime

Initialize a pipeline

Overview

Step 0: Install dlt with workspace support

Step 1: Initialize a custom pipeline

Manual setup (standard workflow)

LLM-native setup

Verified source setup (community connectors)

Step 2: Add credentials

Step 3: Run a pipeline

Next steps: Deploy and scale

DHelp

Ask a question

Overview​

Step 0: Install dlt with workspace support​

Step 1: Initialize a custom pipeline​

Manual setup (standard workflow)​

LLM-native setup​

Verified source setup (community connectors)​

Step 2: Add credentials​

Step 3: Run a pipeline​

Next steps: Deploy and scale​

DHelp

Ask a question

Overview

Step 0: Install dlt with workspace support

Step 1: Initialize a custom pipeline

Manual setup (standard workflow)

LLM-native setup

Verified source setup (community connectors)

Step 2: Add credentials

Step 3: Run a pipeline

Next steps: Deploy and scale