dlt.destinations.impl.ducklake.ducklake
DuckDB object hierarchy
Here are short definitions and relationships between DuckDB objects. This should help disambiguate names used in Duckdb, DuckLake, and dlt.
TL;DR:
- scalar < column < table < schema (dataset) < database = catalog
- Typically, in duckdb, you have one catalog = one database = one file
- When using
ATTACH
, you're addingCatalog
to yourDatabase
- Though if you do
SHOW ALL TABLES
, the result column "database" should be "catalog" to be precise
- Though if you do
Hierarchy:
- A
Table
can have manyColumn
- A
Schema
can have manyTable
- A
Database
can have manySchema
(corresponds to dataset in dlt) - A
Database
is a single physical file (e.g.,db.duckdb
) - A
Database
has a singleCatalog
- A
Catalog
is the internal metadata structure of everything found in the database - Using
ATTACH
adds aCatalog
to the
In dlt:
- dlt creates a duckdb
Database
per pipeline when usingdlt.pipeline(..., destination="duckdb")
- dlt stores the data inside a
Schema
that matches the name of thedlt.Dataset
- when setting the pipeline destination to a specific duckdb
Database
, you can store multipledlt.Dataset
inside the same instance (each with its own duckdbSchema
).
DuckLake object hierarchy
TL;DR:
- scalar < column < table < schema < snapshot < database = catalog
Hierarchy:
- A
Catalog
is an SQL database to store metadata- In duckdb terms, it's a duckdb
Database
that implements the duckdbCatalog
for the DuckLake
- In duckdb terms, it's a duckdb
- A
Catalog
has many Schemas (namespaces if you compare it to Iceberg) that correspond to dlt.Dataset - A
Storage
is a file system or object store that can store parquet files - A
Snapshot
references to theCatalog
at a particular point in time- This places
Snapshot
at the top of the hierarchy because it scopes other constructs
- This places
Using the ducklake
extension, the following command in duckdb
ATTACH 'ducklake:{catalog_database}' (DATA_PATH '{data_storage}');
adds the ducklake Catalog
to your duckdb database
DuckLakeCopyJob Objects
class DuckLakeCopyJob(DuckDbCopyJob)
metrics
def metrics() -> Optional[LoadJobMetrics]
Generate remote url metrics which point to the table in storage
DuckLakeClient Objects
class DuckLakeClient(DuckDbClient)
Destination client to interact with a DuckLake
A DuckLake has 3 components:
- ducklake client: this is a
duckdb
instance with theducklake
extension - catalog: this is an SQL database storing metadata. It can be a duckdb instance (typically the ducklake client) or a remote database (sqlite, postgres, mysql)
- storage: this is a filesystem where data is stored in files
The dlt DuckLake destination gives access to the "ducklake client". You never have to manage the catalog and storage directly; this is done through the ducklake client.