Skip to main content
Version: 1.17.1 (latest)

dlt.destinations.impl.ducklake.ducklake

DuckDB object hierarchy

Here are short definitions and relationships between DuckDB objects. This should help disambiguate names used in Duckdb, DuckLake, and dlt.

TL;DR:

  • scalar < column < table < schema (dataset) < database = catalog
  • Typically, in duckdb, you have one catalog = one database = one file
  • When using ATTACH, you're adding Catalog to your Database
    • Though if you do SHOW ALL TABLES, the result column "database" should be "catalog" to be precise

Hierarchy:

  • A Table can have many Column
  • A Schema can have many Table
  • A Database can have many Schema (corresponds to dataset in dlt)
  • A Database is a single physical file (e.g., db.duckdb)
  • A Database has a single Catalog
  • A Catalog is the internal metadata structure of everything found in the database
  • Using ATTACH adds a Catalog to the

In dlt:

  • dlt creates a duckdb Database per pipeline when using dlt.pipeline(..., destination="duckdb")
  • dlt stores the data inside a Schema that matches the name of the dlt.Dataset
  • when setting the pipeline destination to a specific duckdb Database, you can store multiple dlt.Dataset inside the same instance (each with its own duckdb Schema).

DuckLake object hierarchy

TL;DR:

  • scalar < column < table < schema < snapshot < database = catalog

Hierarchy:

  • A Catalog is an SQL database to store metadata
    • In duckdb terms, it's a duckdb Database that implements the duckdb Catalog for the DuckLake
  • A Catalog has many Schemas (namespaces if you compare it to Iceberg) that correspond to dlt.Dataset
  • A Storage is a file system or object store that can store parquet files
  • A Snapshot references to the Catalog at a particular point in time
    • This places Snapshot at the top of the hierarchy because it scopes other constructs

Using the ducklake extension, the following command in duckdb

ATTACH 'ducklake:&#123;catalog_database&#125;' (DATA_PATH '&#123;data_storage&#125;');

adds the ducklake Catalog to your duckdb database

DuckLakeCopyJob Objects

class DuckLakeCopyJob(DuckDbCopyJob)

View source on GitHub

metrics

def metrics() -> Optional[LoadJobMetrics]

View source on GitHub

Generate remote url metrics which point to the table in storage

DuckLakeClient Objects

class DuckLakeClient(DuckDbClient)

View source on GitHub

Destination client to interact with a DuckLake

A DuckLake has 3 components:

  • ducklake client: this is a duckdb instance with the ducklake extension
  • catalog: this is an SQL database storing metadata. It can be a duckdb instance (typically the ducklake client) or a remote database (sqlite, postgres, mysql)
  • storage: this is a filesystem where data is stored in files

The dlt DuckLake destination gives access to the "ducklake client". You never have to manage the catalog and storage directly; this is done through the ducklake client.

This demo works on codespaces. Codespaces is a development environment available for free to anyone with a Github account. You'll be asked to fork the demo repository and from there the README guides you with further steps.
The demo uses the Continue VSCode extension.

Off to codespaces!

DHelp

Ask a question

Welcome to "Codex Central", your next-gen help center, driven by OpenAI's GPT-4 model. It's more than just a forum or a FAQ hub – it's a dynamic knowledge base where coders can find AI-assisted solutions to their pressing problems. With GPT-4's powerful comprehension and predictive abilities, Codex Central provides instantaneous issue resolution, insightful debugging, and personalized guidance. Get your code running smoothly with the unparalleled support at Codex Central - coding help reimagined with AI prowess.