Version: devel

Create a pipeline

This guide walks you through creating a pipeline that uses our REST API Client to connect to DuckDB.

tip

We're using DuckDB as a destination here, but you can adapt the steps to any source and destination by using the command dlt init <source> <destination> and tweaking the pipeline accordingly.

Please make sure you have installed dlt before following the steps below.

Task overview

Imagine you want to analyze issues from a GitHub project locally. To achieve this, you need to write code that accomplishes the following:

Constructs a correct request.
Authenticates your request.
Fetches and handles paginated issue data.
Stores the data for analysis.

This may sound complicated, but dlt provides a REST API Client that allows you to focus more on your data rather than on managing API interactions.

1. Initialize project

Create a new empty directory for your dlt project by running:

mkdir github_api_duckdb && cd github_api_duckdb

Start a dlt project with a pipeline template that loads data to DuckDB by running:

dlt init github_api duckdb

Install the dependencies necessary for DuckDB:

pip install -r requirements.txt

2. Obtain and add API credentials from GitHub

You will need to sign in to your GitHub account and create your access token via the Personal access tokens page.

Copy your new access token over to .dlt/secrets.toml:

[sources]
api_secret_key = '<api key value>'

This token will be used by github_api_source() to authenticate requests.

The secret name corresponds to the argument name in the source function. Below, api_secret_key will get its value from secrets.toml when github_api_source() is called.

@dlt.source
def github_api_source(api_secret_key: str = dlt.secrets.value):
    return github_api_resource(api_secret_key=api_secret_key)

Run the github_api_pipeline.py pipeline script to test that authentication headers look fine:

python github_api_pipeline.py

Your API key should be printed out to stdout along with some test data.

3. Request project issues from the GitHub API

tip

We will use the dlt repository as an example GitHub project https://github.com/dlt-hub/dlt, feel free to replace it with your own repository.

Modify github_api_resource in github_api_pipeline.py to request issues data from your GitHub project's API:

from dlt.sources.helpers.rest_client import paginate
from dlt.sources.helpers.rest_client.auth import BearerTokenAuth
from dlt.sources.helpers.rest_client.paginators import HeaderLinkPaginator

@dlt.resource(write_disposition="replace")
def github_api_resource(api_secret_key: str = dlt.secrets.value):
    url = "https://api.github.com/repos/dlt-hub/dlt/issues"

    for page in paginate(
        url,
        auth=BearerTokenAuth(api_secret_key), # type: ignore
        paginator=HeaderLinkPaginator(),
        params={"state": "open"}
    ):
        yield page

4. Load the data

Uncomment the commented-out code in the main function in github_api_pipeline.py, so that running the python github_api_pipeline.py command will now also run the pipeline:

if __name__=='__main__':
    # configure the pipeline with your destination details
    pipeline = dlt.pipeline(
        pipeline_name='github_api_pipeline',
        destination='duckdb',
        dataset_name='github_api_data'
    )

    # print credentials by running the resource
    data = list(github_api_resource())

    # print the data yielded from resource
    print(data)

    # run the pipeline with your parameters
    load_info = pipeline.run(github_api_source())

    # pretty print the information on data that was loaded
    print(load_info)

Run the github_api_pipeline.py pipeline script to test that the API call works:

python github_api_pipeline.py

This should print out JSON data containing the issues in the GitHub project.

It also prints the load_info object.

Let's explore the loaded data with the command dlt pipeline <pipeline_name> show.

info

You will need to install pip dlt[workspace]

dlt pipeline github_api_pipeline show

This will open the workspace dashboard app that gives you an overview of the data loaded.

5. Next steps

With a functioning pipeline, consider exploring:

Our REST Client.
Deploy this pipeline with GitHub Actions, so that the data is automatically loaded on a schedule.
Transform the loaded data with dbt or in Pandas DataFrames.
Learn how to run, monitor, and alert when you put your pipeline in production.
Try loading data to a different destination like Google BigQuery, Amazon Redshift, or Postgres.

Create a pipeline

Task overview

1. Initialize project

2. Obtain and add API credentials from GitHub

3. Request project issues from the GitHub API

4. Load the data

5. Next steps

DHelp

Ask a question

Task overview​

1. Initialize project​

2. Obtain and add API credentials from GitHub​

3. Request project issues from the GitHub API​

4. Load the data​

5. Next steps​

DHelp

Ask a question

Task overview

1. Initialize project

2. Obtain and add API credentials from GitHub

3. Request project issues from the GitHub API

4. Load the data

5. Next steps