dlt.destinations.impl.ducklake.ducklake
DuckDB object hierarchy
Here are short definitions and relationships between DuckDB objects. This should help disambiguate names used in Duckdb, DuckLake, and dlt.
TL;DR:
- scalar < column < table < schema (dataset) < database = catalog
- Typically, in duckdb, you have one catalog = one database = one file
- When using
ATTACH, you're addingCatalogto yourDatabase- Though if you do
SHOW ALL TABLES, the result column "database" should be "catalog" to be precise
- Though if you do
Hierarchy:
- A
Tablecan have manyColumn - A
Schemacan have manyTable - A
Databasecan have manySchema(corresponds to dataset in dlt) - A
Databaseis a single physical file (e.g.,db.duckdb) - A
Databasehas a singleCatalog - A
Catalogis the internal metadata structure of everything found in the database - Using
ATTACHadds aCatalogto the
In dlt:
- dlt creates a duckdb
Databaseper pipeline when usingdlt.pipeline(..., destination="duckdb") - dlt stores the data inside a
Schemathat matches the name of thedlt.Dataset - when setting the pipeline destination to a specific duckdb
Database, you can store multipledlt.Datasetinside the same instance (each with its own duckdbSchema).
DuckLake object hierarchy
TL;DR:
- scalar < column < table < schema < snapshot < database = catalog
Hierarchy:
- A
Catalogis an SQL database to store metadata- In duckdb terms, it's a duckdb
Databasethat implements the duckdbCatalogfor the DuckLake
- In duckdb terms, it's a duckdb
- A
Cataloghas many Schemas (namespaces if you compare it to Iceberg) that correspond to dlt.Dataset - A
Storageis a file system or object store that can store parquet files - A
Snapshotreferences to theCatalogat a particular point in time- This places
Snapshotat the top of the hierarchy because it scopes other constructs
- This places
Using the ducklake extension, the following command in duckdb
ATTACH 'ducklake:{catalog_database}' (DATA_PATH '{data_storage}');
adds the ducklake Catalog to your duckdb database
DuckLakeCopyJob Objects
class DuckLakeCopyJob(DuckDbCopyJob)
metrics
def metrics() -> Optional[LoadJobMetrics]
Generate remote url metrics which point to the table in storage
DuckLakeClient Objects
class DuckLakeClient(DuckDbClient)
Destination client to interact with a DuckLake
A DuckLake has 3 components:
- ducklake client: this is a
duckdbinstance with theducklakeextension - catalog: this is an SQL database storing metadata. It can be a duckdb instance (typically the ducklake client) or a remote database (sqlite, postgres, mysql)
- storage: this is a filesystem where data is stored in files
The dlt DuckLake destination gives access to the "ducklake client". You never have to manage the catalog and storage directly; this is done through the ducklake client.