Access loaded data in Python
This guide explains how to access and manipulate data that has been loaded into your destination using the dlt
Python library. After running your pipelines and loading data, you can use the pipeline.dataset()
and data frame expressions, Ibis or SQL to query the data and read it as records, Pandas frames or Arrow tables.
Quick start example
Here's a full example of how to retrieve data from a pipeline and load it into a Pandas DataFrame or a PyArrow Table.
# Assuming you have a Pipeline object named 'pipeline'. You can create one with the dlt cli: dlt init fruitshop duckdb
# and you have loaded the data of the fruitshop example source into the destination
# the tables available in the destination are:
# - customers
# - inventory
# - purchases
# Step 1: Get the readable dataset from the pipeline
dataset = pipeline.dataset()
# Step 2: Access a table as a ReadableRelation
customers_relation = dataset.customers # Or dataset["customers"]
# Step 3: Fetch the entire table as a Pandas DataFrame
df = customers_relation.df() # or customers_relation.df(chunk_size=50)
# Alternatively, fetch as a PyArrow Table
arrow_table = customers_relation.arrow()
Getting started
Assuming you have a Pipeline
object (let's call it pipeline
), you can obtain a Dataset
which is contains the crendentials and schema to your destination dataset. You can run construct a query and execute it on the dataset to retrieve a Relation
which you may use to retrieve data from the Dataset
.
Note: The Dataset
and Relation
objects are lazy-loading. They will only query and retrieve data when you perform an action that requires it, such as fetching data into a DataFrame or iterating over the data. This means that simply creating these objects does not load data into memory, making your code more efficient.
Access the dataset
# Get the readable dataset from the pipeline
dataset = pipeline.dataset()
# print the row counts of all tables in the destination as dataframe
print(dataset.row_counts().df())
Access tables as dataset
The simplest way of getting a Relation from a Dataset is to get a full table relation:
# Using attribute access
customers_relation = dataset.customers
# Using item access
customers_relation = dataset["customers"]
Creating relations with sql query strings
# Join 'customers' and 'purchases' tables and filter by quantity
query = """
SELECT *
FROM customers
JOIN purchases
ON customers.id = purchases.customer_id
WHERE purchases.quantity > 1
"""
joined_relation = dataset(query)
Reading data
Once you have a Relation
, you can read data in various formats and sizes.
Fetch the entire table
Loading full tables into memory without limiting or iterating over them can consume a large amount of memory and may cause your program to crash if the table is too large. It's recommended to use chunked iteration or apply limits when dealing with large datasets.
As a Pandas DataFrame
df = customers_relation.df()
As a PyArrow Table
arrow_table = customers_relation.arrow()
As a list of Python tuples
items_list = customers_relation.fetchall()
Lazy loading behavior
The Dataset
and Relation
objects are lazy-loading. This means that they do not immediately fetch data when you create them. Data is only retrieved when you perform an action that requires it, such as calling .df()
, .arrow()
, or iterating over the data. This approach optimizes performance and reduces unnecessary data loading.
Iterating over data in chunks
To handle large datasets efficiently, you can process data in smaller chunks.
Iterate as Pandas DataFrames
for df_chunk in customers_relation.iter_df(chunk_size=5):
# Process each DataFrame chunk
pass
Iterate as PyArrow Tables
for arrow_chunk in customers_relation.iter_arrow(chunk_size=5):
# Process each PyArrow chunk
pass
Iterate as lists of tuples
for items_chunk in customers_relation.iter_fetch(chunk_size=5):
# Process each chunk of tuples
pass
The methods available on the Relation correspond to the methods available on the cursor returned by the SQL client. Please refer to the SQL client guide for more information.
Connection Handling
For every call that actually fetches data from the destination, such as df()
, arrow()
, fetchall()
etc., the dataset will open a connection and close it after it has been retrieved or the iterator is completed. You can keep the connection open for multiple requests with the dataset context manager:
# the dataset context manager will keep the connection open
# and close it after the with block is exited
with dataset as dataset_:
print(dataset.customers.limit(50).arrow())
print(dataset.purchases.arrow())
Special queries
You can use the row_counts
method to get the row counts of all tables in the destination as a DataFrame.
# print the row counts of all tables in the destination as dataframe
print(dataset.row_counts().df())
# or as tuples
print(dataset.row_counts().fetchall())
Modifying queries
You can refine your data retrieval by limiting the number of records, selecting specific columns, sorting the results, filtering rows, aggregating minimum and maximum values on a specific column, or chaining these operations.
Limit the number of records
# Get the first 50 items as a PyArrow table
arrow_table = customers_relation.limit(50).arrow()
Using head()
to get the first 5 records
df = customers_relation.head().df()