Skip to main content
Version: 1.5.0 (latest)

Create a pipeline

This guide walks you through creating a pipeline that uses our REST API Client to connect to DuckDB.

tip

We're using DuckDB as a destination here, but you can adapt the steps to any source and destination by using the command dlt init <source> <destination> and tweaking the pipeline accordingly.

Please make sure you have installed dlt before following the steps below.

Task overview

Imagine you want to analyze issues from a GitHub project locally. To achieve this, you need to write code that accomplishes the following:

  1. Constructs a correct request.
  2. Authenticates your request.
  3. Fetches and handles paginated issue data.
  4. Stores the data for analysis.

This may sound complicated, but dlt provides a REST API Client that allows you to focus more on your data rather than on managing API interactions.

1. Initialize project

Create a new empty directory for your dlt project by running:

mkdir github_api_duckdb && cd github_api_duckdb

Start a dlt project with a pipeline template that loads data to DuckDB by running:

dlt init github_api duckdb

Install the dependencies necessary for DuckDB:

pip install -r requirements.txt

2. Obtain and add API credentials from GitHub

You will need to sign in to your GitHub account and create your access token via the Personal access tokens page.

Copy your new access token over to .dlt/secrets.toml:

[sources]
api_secret_key = '<api key value>'

This token will be used by github_api_source() to authenticate requests.

The secret name corresponds to the argument name in the source function. Below, api_secret_key will get its value from secrets.toml when github_api_source() is called.

@dlt.source
def github_api_source(api_secret_key: str = dlt.secrets.value):
return github_api_resource(api_secret_key=api_secret_key)

Run the github_api.py pipeline script to test that authentication headers look fine:

python github_api.py

Your API key should be printed out to stdout along with some test data.

3. Request project issues from the GitHub API

tip

We will use the dlt repository as an example GitHub project https://github.com/dlt-hub/dlt, feel free to replace it with your own repository.

Modify github_api_resource in github_api.py to request issues data from your GitHub project's API:

from dlt.sources.helpers.rest_client import paginate
from dlt.sources.helpers.rest_client.auth import BearerTokenAuth
from dlt.sources.helpers.rest_client.paginators import HeaderLinkPaginator

@dlt.resource(write_disposition="replace")
def github_api_resource(api_secret_key: str = dlt.secrets.value):
url = "https://api.github.com/repos/dlt-hub/dlt/issues"

for page in paginate(
url,
auth=BearerTokenAuth(api_secret_key), # type: ignore
paginator=HeaderLinkPaginator(),
params={"state": "open"}
):
yield page

4. Load the data

Uncomment the commented-out code in the main function in github_api.py, so that running the python github_api.py command will now also run the pipeline:

if __name__=='__main__':
# configure the pipeline with your destination details
pipeline = dlt.pipeline(
pipeline_name='github_api_pipeline',
destination='duckdb',
dataset_name='github_api_data'
)

# print credentials by running the resource
data = list(github_api_resource())

# print the data yielded from resource
print(data)

# run the pipeline with your parameters
load_info = pipeline.run(github_api_source())

# pretty print the information on data that was loaded
print(load_info)

Run the github_api.py pipeline script to test that the API call works:

python github_api.py

This should print out JSON data containing the issues in the GitHub project.

It also prints the load_info object.

Let's explore the loaded data with the command dlt pipeline <pipeline_name> show.

info

Make sure you have streamlit installed: pip install streamlit

dlt pipeline github_api_pipeline show

This will open a Streamlit app that gives you an overview of the data loaded.

5. Next steps

With a functioning pipeline, consider exploring:

Create a pipeline with GPT-4

Create a pipeline with GPT-4

Create dlt pipeline using the data source of your liking and let the GPT-4 write the resource functions and help you to debug the code.
This demo works on codespaces. Codespaces is a development environment available for free to anyone with a Github account. You'll be asked to fork the demo repository and from there the README guides you with further steps.
The demo uses the Continue VSCode extension.

Off to codespaces!

DHelp

Ask a question

Welcome to "Codex Central", your next-gen help center, driven by OpenAI's GPT-4 model. It's more than just a forum or a FAQ hub – it's a dynamic knowledge base where coders can find AI-assisted solutions to their pressing problems. With GPT-4's powerful comprehension and predictive abilities, Codex Central provides instantaneous issue resolution, insightful debugging, and personalized guidance. Get your code running smoothly with the unparalleled support at Codex Central - coding help reimagined with AI prowess.