Blog /March 21, 2025

Query your API’s data with SQL, no data warehouse needed

Aman Gupta,
Jr. Data Engineer

Ever wanted to run a SELECT * on your API data without setting up a database or data warehouse?

Building ETL pipelines from REST APIs involves pagination, schema drift, rate limits, and incremental loading. dlt’s REST API source handles all of these concerns automatically, making data ingestion straightforward.

But what if your goal is even simpler? Do you just want to query or audit data quickly using SQL directly on your filesystem without infrastructure overhead?

Enter dlt datasets

dlt datasets provide direct SQL access to your staged API data without requiring any separate database setup. They follow the Write-Audit-Publish (WAP) pattern, allowing you to:

Query data directly from a filesystem or cloud storage.
Interact programmatically using classes like ReadableDataset (datasets) and ReadableRelation (tables).
Move data to other destinations after queries, seamlessly integrating with your data workflows.

Think of dlt datasets as a lazy database: data is only loaded when explicitly queried, making your workflow efficient and straightforward.

Quick demo: Querying GitHub API data using SQL

Reason for notebook: I want to query APIs with SQL. I don’t want to make requests and handle the response and load it somewhere else. Perhaps I don’t even know Python besides some tutorials.

Path taken: I am using a config-based REST API client from dlt + their datasets access that spins up an in-memory db on top of the filesystem when we want to query data.

Result: I can access and review the data using datasets.

Steps:

1. Install dependencies:

!pip install -q dlt[filesystem]  dlt[pyiceberg]

2. Define and run your pipeline:

import os
import dlt
from dlt.common.pendulum import pendulum
from dlt.sources.rest_api import rest_api_resources

#Create a source
@dlt.source(name="github")
def github_source():
    yield from rest_api_resources({
        "client": {"base_url": "https://api.github.com/repos/dlt-hub/dlt/"},
        "resources": [{
            "name": "issues",
            "endpoint": {
                "path": "issues",
                "params": {
                    "per_page": 100,
                    "state": "all",
                    "since": {
                        "type": "incremental",
                        "cursor_path": "updated_at",
                        "initial_value": pendulum.today().subtract(months=6).to_iso8601_string(),
                    },
                },
            },
            "table_format": "iceberg",
        }],
    })

#Create  a pipeline
pipeline = dlt.pipeline(
    pipeline_name="rest_api_github",
    destination="filesystem",
    dataset_name="rest_api_data",
)
print(pipeline.run(github_source()))

3. Query your data using SQL:

# Retrieve the dataset object from the pipeline
dataset = pipeline.dataset()

# Execute SQL query to aggregate issues by month and state
sql_code = dataset(
    """
    SELECT
        strftime(created_at, '%Y-%m') AS issue_month,
        state,
        COUNT(*) AS issue_count
    FROM issues
    WHERE created_at >= CURRENT_DATE - INTERVAL '6 months'
    GROUP BY 1, 2
    ORDER BY 1 DESC, 2
    """
)

# Display the first few rows of the query result DataFrame
print(sql_code.df().head())

You'll get a summary like this:

3. Visualize the results (optional):

import matplotlib.pyplot as plt
import seaborn as sns

df = sql_code.df()  # Get the DataFrame

plt.figure(figsize=(10, 5))
ax = sns.barplot(x="issue_month", y="issue_count", hue="state", data=df)

#Add labels on top of each bar
for p in ax.patches:
    height = int(p.get_height())
    if height > 0:  # Only annotate non-zero bars
        ax.annotate(f'{height}',
                    (p.get_x() + p.get_width() / 2, p.get_height()),
                    ha='center', va='bottom', fontsize=10, fontweight='bold')

#plt.xticks(rotation=45)
plt.title("GitHub Issues (Open vs Closed) - Last 5 Months")
plt.show()

Not every situation calls for a cloud warehouse. When simplicity is key, just query your API data directly, no extra infrastructure needed.

Want more?

Try it now! Query APIs in SQL without setting up infrastructure. → Google Colab demo.
Or read more about REST API source and datasets access in our docs

Why Iceberg + Python is the Future of Open Data Lakes

From Airbyte YAML to Python with dlt, Cursor and LLMs

Enter dlt datasetsLink icon

Quick demo: Querying GitHub API data using SQLLink icon

Want more?Link icon

Enter dlt datasets

Quick demo: Querying GitHub API data using SQL

Want more?