DuckLake
DuckLake is a lakehouse-style destination that builds on the DuckDB engine with the ducklake extension. It stores your dlt
tables as files on a filesystem or object store while keeping table metadata in a separate SQL catalog.
In order to use ducklake you must provide the following infrastructure:
- catalog: a SQL database that stores table/partition metadata (sqlite, duckdb, postgres, mysql are supported)
- storage: a filesystem or object store holding table files (local files, s3, gcs, abfss, etc.)
If you are looking for a managed ducklake infra, check the Motherduck Ducklake support. dlt
is also able to set-up a local ducklake with sqlite
as catalog fully automatically.
Quick startโ
- Install dlt with DuckDB dependencies:
pip install "dlt[ducklake]"
- Initialize new test pipeline
dlt init foo ducklake
dlt init
will create a sample secrets.toml
for postgres catalog and s3 bucket storage. For local automatic setup comment out catalog and storage entries:
[destination.ducklake.credentials]
catalog_name="lake_catalog" # we recommend explicit catalog name
- Run a test pipeline that writes to a local DuckLake:
import dlt
pipeline = dlt.pipeline(
pipeline_name="foo",
destination="ducklake",
dataset_name="lake_schema",
dev_mode=True,
)
info = pipeline.run(
[{"foo": 1}, {"foo": 2}],
table_name="table_foo",
)
print(info)
print(pipeline.dataset().table_foo["foo"].df())
The console output will point you to where sqlite
catalog database and data store were created:
lake_catalog.sqlite
catalog in current working directorylake_catalog.files
folder withlake_schema
subfolder for the dataset.
Configure Ducklakeโ
Pick your catalog_name
as described above. This name is the used:
- as attach name for the ducklake - each ducklake connection starts with :memory: connection to which we
ATTACH
the ducklake - to set default folder name of the local filesystem storage and database file name for
sqlite
andduckdb
(if no explicit configuration is provided) - as postgres schema name where catalog tables will be created (if postgres configured)
Configure catalogโ
You have the following options when configuring the catalog
- sqlite: very fast local catalog. You can set it up as follows:
[destination.ducklake.credentials]
catalog="sqlite:///catalog_x.db"
Snippet above stores catalog in catalog_x.db
in cwd. Refer to sqlite configuration in sqlalchemy
destination which reused the same configuration structure.
Note that we are not able to setup sqlite to write in parallel, even with WAL
journaling.
Parallel writes produce conflicts on practically every catalog transactions so we had to put loader in sequential mode.
- duckdb: pretty fast and working only in sequential mode like sqlite. Parallel loads generate page faults and corrupt the catalog database.
[destination.ducklake.credentials]
catalog="duckdb:///catalog_y.duckdb"
Refer to duckdb configuration for more options.
- postgres: currently the only catalog that can be considered production-grade with full parallelism support.
[destination.ducklake.credentials]
catalog="postgres://loader:pass@localhost:5432/dlt_data"
ducklake
will use postgres schema with the name of catalog_name
config option and create required tables automatically.
-
๐งช mysql: uses the same code path as for postgres but we never tested it
-
๐งช motherduck: theoretically you could use Motherduck as catalog database. We were able to establish connection but unfortunately ducklake 1.2 segfaults when catalog is being attached.
[destination.ducklake.credentials]
catalog="md:///dlt_data"
Make sure that you have Motherduck token in your environment. Hopefully situation improves when duckdb 1.4 is supported.
Configure storageโ
storage config reuses configuration of filesystem destination. You can pick the following options:
- Local files: file:///path or a plain relative path
- S3: s3://bucket/prefix
- GCS: gs://bucket/prefix or gcs://bucket/prefix (uses fsspec fallback)
- Azure ADLS Gen2: abfss://container@account.dfs.core.windows.net/prefix (uses fsspec fallback)
Example s3 configuration:
[destination.ducklake.credentials]
catalog_name="lake_catalog"
catalog="postgres://loader:pass@localhost:5432/dlt_data"
[destination.ducklake.credentials.storage]
bucket_url="s3://dlt-ci-bucket"
[destination.ducklake.credentials.storage.credentials]
aws_access_key_id = "<configure me>" # fill this in!
aws_secret_access_key = "<configure me>" # fill this in!
Configure additional connection options, pragmas and extensionsโ
You can set additional connection options, pragmas and extensions - ducklake
configuration reuses duckdb configuration
[destination.ducklake.credentials.global_config]
ducklake_max_retry_count=100
Configure in codeโ
You can create ducklake destination instance and configure it in code. In most cases you will just set additional options while still using the configuration:
import dlt
# force parallel loads on sqlite
ducklake = dlt.destinations.ducklake(loader_parallelism_strategy="parallel")
pipeline = dlt.pipeline("test_factory", destination=ducklake, dataset_name="foo")
Above we force parallel loading on (default) sqlite catalog.
DuckLakeCredentials
have friendly constructor where you can pass catalog and storage credentials
as shorthand strings and objects:
import dlt
from dlt.destinations.impl.ducklake.configuration import DuckDbBaseCredentials
# set ducklake credentials using shorthands, s3 bucket requires secrets in config
credentials = DuckLakeCredentials(
"lake_catalog",
catalog="postgresql://loader:pass@localhost:5432/dlt_data",
storage="s3://dlt-ci-test-bucket/lake",
)
ducklake = dlt.destinations.ducklake(credentials=credentials)
import dlt
from dlt.sources.credentials import ConnectionStringCredentials
# set catalog name using connection string credentials
catalog_credentials = ConnectionStringCredentials()
# use duckdb with the default name
catalog_credentials.drivername = "duckdb"
credentials = DuckLakeCredentials(
"lake_catalog",
catalog=catalog_credentials,
)
As mentioned above, filesystem
and ducklake
share the same configuration object. Configuration for the filesystem
can be reused:
# `filesystem` below is a pipeline with configured filesystem destination
destination = dlt.destinations.ducklake(
credentials=DuckLakeCredentials(
"lake_catalog",
storage=filesystem.destination_client().config
)
)
Maintain ducklakeโ
Data accessโ
You have read and write access to the data in ducklake. You can take native duckdb connection with attached catalog and authenticated
storage using sql_client
. This is demonstrated in examples below.
dataset access and ibis handover are fully supported.
Set catalog optionsโ
Certain ducklake options are persisted in the catalog and are set differently than connection options. You can do that from code:
import dlt
import duckdb
pipeline = dlt.pipeline(pipeline_name="foo", destination="ducklake", dataset_name="lake_schema")
# set per thread output option before pipeline runs so options are applied
with pipeline.sql_client() as client:
con: duckdb.DuckDBPyConnection = client.native_connection
# set option on `lake_catalog` we configured above
con.sql("CALL lake_catalog.set_option('per_thread_output', true)")
Above we set per_thread_output
(1.4.x only) before pipeline runs.
Table maintenanceโ
dlt
has a standard interface to access open tables and catalogs but this is not implemented for ducklake (yet). However in case of
ducklake
you just need configured and authorized connection which you can get after pipeline runs to do the maintenance.
# pipeline.run(...)
with pipeline.sql_client() as client:
print(client.execute_sql("CALL bucket_cat.merge_adjacent_files()"))
Write dispositionโ
All write dispositions are supported. upsert
is supported on duckdb 1.4.x (without hard deletes for now)
Data loadingโ
By default, Parquet files and the COPY
command are used to move local files to the remote storage,
The INSERT format is also supported and will execute large INSERT queries directly into the remote database. This method is significantly slower and may exceed the maximum query size, so it is not advised.
partition hint on a column is supported and works on duckdb 1.4.x. Simple identity partitions are created. Partition evolution is not supported.
parallel loading is supported via thread pool for postgres catalog (and probably mysql). We could not use recommended method because the threads were (dead)locking. We open separate in-memory database for each thread to which we attach the catalog.
dbt supportโ
Not supported. We'd need to handover secrets and ATTACH
command which is not planned at this moment.
Syncing of dlt
stateโ
This destination fully supports dlt state sync.
ToDoโ
- open table interface for table maintenance like we have for iceberg and delta.
- better partitioning support.
- Motherduck as catalog if possible.
- support additional
ATTACH
options likeOVERRIDE_DATA_PATH
- implement callbacks that will be called on creation of :memory: database and
ATTACH
command so those can be fully customized.