Load data from a REST API
This tutorial demonstrates how to extract data from a REST API using dlt's REST API source and load it into a destination. You will learn how to build a data pipeline that loads data from the Pokemon and the GitHub API into a local DuckDB database.
Extracting data from an API is straightforward with dlt: provide the base URL, define the resources you want to fetch, and dlt will handle the pagination, authentication, and data loading.
What you will learn
- How to set up a REST API source
- Configuration basics for API endpoints
- Configuring the destination database
- Relationships between different resources
- How to append, replace, and merge data in the destination
- Loading data incrementally by fetching only new or updated data
Prerequisites
- Python 3.9 or higher installed
- Virtual environment set up
Installing dlt
Before we start, make sure you have a Python virtual environment set up. Follow the instructions in the installation guide to create a new virtual environment and install dlt.
Verify that dlt is installed by running the following command in your terminal:
dlt --version
If you see the version number (such as "dlt 0.5.3"), you're ready to proceed.
Setting up a new project
Initialize a new dlt project with a REST API source and DuckDB destination:
dlt init rest_api duckdb
dlt init creates multiple files and a directory for your project. Let's take a look at the project structure:
rest_api_pipeline.py
requirements.txt
.dlt/
config.toml
secrets.toml
Here's what each file and directory contains:
rest_api_pipeline.py: This is the main script where you'll define your data pipeline. It contains two basic pipeline examples for Pokemon and GitHub APIs. You can modify or rename this file as needed.requirements.txt: This file lists all the Python dependencies required for your project..dlt/: This directory contains the configuration files for your project:secrets.toml: This file stores your API keys, tokens, and other sensitive information.config.toml: This file contains the configuration settings for your dlt project.
Installing dependencies
Before we proceed, let's install the required dependencies for this tutorial. Run the following command to install the dependencies listed in the requirements.txt file:
pip install -r requirements.txt
Running the pipeline
Let's verify that the pipeline is working as expected. Run the following command to execute the pipeline:
python rest_api_pipeline.py
You should see the output of the pipeline execution in the terminal. The output will also display the location of the DuckDB database file where the data is stored:
Pipeline rest_api_pokemon load step completed in 1.08 seconds
1 load package(s) were loaded to destination duckdb and into dataset rest_api_data
The duckdb destination used duckdb:////home/user-name/quick_start/rest_api_pokemon.duckdb location to store data
Load package 1692364844.9254808 is LOADED and contains no failed jobs