Skip to main content
Version: devel

The SQL client

note

This page contains technical details about the implementation of the SQL client as well as information on how to use low-level APIs. If you simply want to query your data, it's advised to read the pages in this section on accessing data via dlt datasets, Streamlit, or Ibis.

Most dlt destinations use an implementation of the SqlClientBase class to connect to the physical destination to which your data is loaded. DDL statements, data insert or update commands, as well as SQL merge and replace queries, are executed via a connection on this client. It also is used for reading data for the Streamlit app and data access via dlt datasets.

All SQL destinations make use of an SQL client; additionally, the filesystem has a special implementation of the SQL client which you can read about below.

Executing a query on the SQL client

You can access the SQL client of your destination via the sql_client method on your pipeline. The code below shows how to use the SQL client to execute a query.

pipeline = dlt.pipeline(destination="bigquery", dataset_name="crm")
with pipeline.sql_client() as client:
with client.execute_query(
"SELECT id, name, email FROM customers WHERE id = %s",
10
) as cursor:
# get all data from the cursor as a list of tuples
print(cursor.fetchall())

Retrieving the data in different formats

The cursor returned by execute_query has several methods for retrieving the data. The supported formats are Python tuples, Pandas DataFrame, and Arrow table.

The code below shows how to retrieve the data as a Pandas DataFrame and then manipulate it in memory:

pipeline = dlt.pipeline(pipeline_name="my_pipeline", destination="duckdb")
with pipeline.sql_client() as client:
with client.execute_query(
'SELECT "reactions__+1", "reactions__-1", reactions__laugh, reactions__hooray, reactions__rocket FROM issues'
) as cursor:
# calling `df` on a cursor, returns the data as a pandas DataFrame
reactions = cursor.df()
counts = reactions.sum(0).sort_values(0, ascending=False)

Supported methods on the cursor

  • fetchall(): returns all rows as a list of tuples;
  • fetchone(): returns a single row as a tuple;
  • fetchmany(size=None): returns a number of rows as a list of tuples; if no size is provided, all rows are returned;
  • df(chunk_size=None, **kwargs): returns the data as a Pandas DataFrame; if chunk_size is provided, the data is retrieved in chunks of the given size;
  • arrow(chunk_size=None, **kwargs): returns the data as an Arrow table; if chunk_size is provided, the data is retrieved in chunks of the given size;
  • iter_fetch(chunk_size: int): iterates over the data in chunks of the given size as lists of tuples;
  • iter_df(chunk_size: int): iterates over the data in chunks of the given size as Pandas DataFrames;
  • iter_arrow(chunk_size: int): iterates over the data in chunks of the given size as Arrow tables.
info

Which retrieval method you should use very much depends on your use case and the destination you are using. Some drivers for our destinations provided by their vendors natively support Arrow or Pandas DataFrames; in these cases, we will use that interface. If they do not, dlt will convert lists of tuples into these formats.

The filesystem SQL client

The filesystem destination implements a special but extremely useful version of the SQL client. While during a normal pipeline run, the filesystem does not make use of an SQL client but rather copies the files resulting from a load into the folder or bucket you have specified, it is possible to query this data using SQL via this client. For this to work, dlt uses an in-memory DuckDB database instance and makes your filesystem tables available as views on this database. For the most part, you can use the filesystem SQL client just like any other SQL client. dlt uses sqlglot to discover which tables you are trying to access and, as mentioned above, DuckDB to make them queryable.

The code below shows how to use the filesystem SQL client to query the data:

pipeline = dlt.pipeline(destination="filesystem", dataset_name="my_dataset")
with pipeline.sql_client() as client:
with client.execute_query("SELECT * FROM my_table") as cursor:
print(cursor.fetchall())

A few things to know or keep in mind when using the filesystem SQL client:

  • The SQL database you are actually querying is an in-memory database, so if you do any kind of mutating queries, these will not be persisted to your folder or bucket.
  • You must have loaded your data as JSONL or Parquet files for this SQL client to work. For optimal performance, you should use Parquet files, as DuckDB is able to only read the bytes needed to execute your query from a folder or bucket in this case.
  • Keep in mind that if you do any filtering, sorting, or full table loading with the SQL client, the in-memory DuckDB instance will have to download and query a lot of data from your bucket or folder if you have a large table.
  • If you are accessing data on a bucket, dlt will temporarily store your credentials in DuckDB to let it connect to the bucket.
  • Some combinations of buckets and table formats may not be fully supported at this time.

This demo works on codespaces. Codespaces is a development environment available for free to anyone with a Github account. You'll be asked to fork the demo repository and from there the README guides you with further steps.
The demo uses the Continue VSCode extension.

Off to codespaces!

DHelp

Ask a question

Welcome to "Codex Central", your next-gen help center, driven by OpenAI's GPT-4 model. It's more than just a forum or a FAQ hub – it's a dynamic knowledge base where coders can find AI-assisted solutions to their pressing problems. With GPT-4's powerful comprehension and predictive abilities, Codex Central provides instantaneous issue resolution, insightful debugging, and personalized guidance. Get your code running smoothly with the unparalleled support at Codex Central - coding help reimagined with AI prowess.