Version: devel

dlt.destinations.impl.ducklake.ducklake

DuckDB object hierarchy

Here are short definitions and relationships between DuckDB objects. This should help disambiguate names used in Duckdb, DuckLake, and dlt.

TL;DR:

scalar < column < table < schema (dataset) < database = catalog
Typically, in duckdb, you have one catalog = one database = one file
When using ATTACH, you're adding Catalog to your Database
- Though if you do SHOW ALL TABLES, the result column "database" should be "catalog" to be precise

Hierarchy:

A Table can have many Column
A Schema can have many Table
A Database can have many Schema (corresponds to dataset in dlt)
A Database is a single physical file (e.g., db.duckdb)
A Database has a single Catalog
A Catalog is the internal metadata structure of everything found in the database
Using ATTACH adds a Catalog to the

In dlt:

dlt creates a duckdb Database per pipeline when using dlt.pipeline(..., destination="duckdb")
dlt stores the data inside a Schema that matches the name of the dlt.Dataset
when setting the pipeline destination to a specific duckdb Database, you can store multiple dlt.Dataset inside the same instance (each with its own duckdb Schema).

DuckLake object hierarchy

TL;DR:

scalar < column < table < schema < snapshot < database = catalog

Hierarchy:

A Catalog is an SQL database to store metadata
- In duckdb terms, it's a duckdb Database that implements the duckdb Catalog for the DuckLake
A Catalog has many Schemas (namespaces if you compare it to Iceberg) that correspond to dlt.Dataset
A Storage is a file system or object store that can store parquet files
A Snapshot references to the Catalog at a particular point in time
- This places Snapshot at the top of the hierarchy because it scopes other constructs

Using the ducklake extension, the following command in duckdb

ATTACH 'ducklake:&#123;catalog_database&#125;' (DATA_PATH '&#123;data_storage&#125;');

adds the ducklake Catalog to your duckdb database

DuckLakeCopyJob Objects

class DuckLakeCopyJob(DuckDbCopyJob)

View source on GitHub

metrics

def metrics() -> Optional[LoadJobMetrics]

View source on GitHub

Generate remote url metrics which point to the table in storage

DuckLakeClient Objects

class DuckLakeClient(DuckDbClient)

View source on GitHub

Destination client to interact with a DuckLake

A DuckLake has 3 components:

ducklake client: this is a duckdb instance with the ducklake extension
catalog: this is an SQL database storing metadata. It can be a duckdb instance (typically the ducklake client) or a remote database (sqlite, postgres, mysql)
storage: this is a filesystem where data is stored in files

The dlt DuckLake destination gives access to the "ducklake client". You never have to manage the catalog and storage directly; this is done through the ducklake client.

dlt.destinations.impl.ducklake.ducklake

DuckLakeCopyJob Objects

metrics

DuckLakeClient Objects

DHelp

Ask a question

DuckLakeCopyJob Objects​

metrics​

DuckLakeClient Objects​

DHelp

Ask a question

DuckLakeCopyJob Objects

metrics

DuckLakeClient Objects