Create a pipeline
This guide walks you through creating a pipeline that uses our REST API Client to connect to DuckDB.
We're using DuckDB as a destination here, but you can adapt the steps to any source and destination by
using the command dlt init <source> <destination>
and tweaking the pipeline accordingly.
Please make sure you have installed dlt
before following the
steps below.
Task overview
Imagine you want to analyze issues from a GitHub project locally. To achieve this, you need to write code that accomplishes the following:
- Constructs a correct request.
- Authenticates your request.
- Fetches and handles paginated issue data.
- Stores the data for analysis.
This may sound complicated, but dlt provides a REST API Client that allows you to focus more on your data rather than on managing API interactions.
1. Initialize project
Create a new empty directory for your dlt
project by running:
mkdir github_api_duckdb && cd github_api_duckdb
Start a dlt
project with a pipeline template that loads data to DuckDB by running:
dlt init github_api duckdb
Install the dependencies necessary for DuckDB:
pip install -r requirements.txt
2. Obtain and add API credentials from GitHub
You will need to sign in to your GitHub account and create your access token via the Personal access tokens page.
Copy your new access token over to .dlt/secrets.toml
:
[sources]
api_secret_key = '<api key value>'
This token will be used by github_api_source()
to authenticate requests.
The secret name corresponds to the argument name in the source function.
Below, api_secret_key
will get its value
from secrets.toml
when github_api_source()
is called.
@dlt.source
def github_api_source(api_secret_key: str = dlt.secrets.value):
return github_api_resource(api_secret_key=api_secret_key)
Run the github_api.py
pipeline script to test that authentication headers look fine:
python github_api.py
Your API key should be printed out to stdout along with some test data.
3. Request project issues from the GitHub API
We will use the dlt
repository as an example GitHub project https://github.com/dlt-hub/dlt, feel free to replace it with your own repository.
Modify github_api_resource
in github_api.py
to request issues data from your GitHub project's API:
from dlt.sources.helpers.rest_client import paginate
from dlt.sources.helpers.rest_client.auth import BearerTokenAuth
from dlt.sources.helpers.rest_client.paginators import HeaderLinkPaginator
@dlt.resource(write_disposition="replace")
def github_api_resource(api_secret_key: str = dlt.secrets.value):
url = "https://api.github.com/repos/dlt-hub/dlt/issues"
for page in paginate(
url,
auth=BearerTokenAuth(api_secret_key), # type: ignore
paginator=HeaderLinkPaginator(),
params={"state": "open"}
):
yield page
4. Load the data
Uncomment the commented-out code in the main
function in github_api.py
, so that running the
python github_api.py
command will now also run the pipeline:
if __name__=='__main__':
# configure the pipeline with your destination details
pipeline = dlt.pipeline(
pipeline_name='github_api_pipeline',
destination='duckdb',
dataset_name='github_api_data'
)
# print credentials by running the resource
data = list(github_api_resource())
# print the data yielded from resource
print(data)
# run the pipeline with your parameters
load_info = pipeline.run(github_api_source())
# pretty print the information on data that was loaded
print(load_info)
Run the github_api.py
pipeline script to test that the API call works:
python github_api.py
This should print out JSON data containing the issues in the GitHub project.
It also prints the load_info
object.
Let's explore the loaded data with the command dlt pipeline <pipeline_name> show
.
Make sure you have streamlit
installed: pip install streamlit
dlt pipeline github_api_pipeline show
This will open a Streamlit app that gives you an overview of the data loaded.
5. Next steps
With a functioning pipeline, consider exploring:
- Our REST Client.
- Deploy this pipeline with GitHub Actions, so that the data is automatically loaded on a schedule.
- Transform the loaded data with dbt or in Pandas DataFrames.
- Learn how to run, monitor, and alert when you put your pipeline in production.
- Try loading data to a different destination like Google BigQuery, Amazon Redshift, or Postgres.