Skip to main content
Version: devel

Iceberg

dlt+

This page is for dlt+, which requires a license. Join our early access program for a trial license.

Apache Iceberg is an open table format designed for high-performance analytics on large datasets. It supports ACID transactions, schema evolution, and time travel.

The Iceberg destination in dlt allows you to load data into Iceberg tables using the pyiceberg library. It supports multiple catalog types and both local and cloud storage backends.

Features

  • Compatible with SQL and REST catalogs (Lakekeeper, Polaris)
  • Automatic schema evolution and table creation
  • All write dispositions supported
  • Works with local filesystems and cloud storage (S3, Azure, GCS)
  • Exposes data via DuckDB views using pipeline.dataset()

Prerequisites

Make sure you have installed the necessary dependencies:

pip install dlt[filesystem,pyiceberg]>=1.9.1
pip install dlt-plus>=0.9.1

Configuration

Overview

To configure Iceberg destination you need to choose and configure the catalog. The role of iceberg catalog is to:

  • store metadata and coordinate transactions (required)
  • generate and hand credentials to pyiceberg client (credentials vending)
  • generate and hand locations for newly generated tables (rest catalogs)

Currently, the Iceberg destination supports two catalog types:

  • SQL-based catalog. Ideal for local development; stores metadata in SQLite or PostgreSQL
  • REST catalog. Used in production with systems like Lakekeeper or Polaris

SQL catalog

The SQL catalog is ideal for development and testing. It does not provide credential or location vending, so these must be configured manually. It supports local storage paths, such as a file-based SQLite database, and is generally used for working with local filesystems.

To configure a SQL catalog, provide the following parameters:

# we recommend to put sensitive parameters to secrets.toml
destinations:
iceberg_lake:
type: iceberg
catalog_type: sql
credentials: "sqlite:///catalog.db" # connection string for accessing the database
filesystem:
bucket_url: "path_to_data" # table location
# credentials section below is only needed if you're using the cloud storage (not local disk)
# we recommend to put sensitive parameters to secrets.toml
credentials:
aws_access_key_id: "please set me up!" # only if needed
aws_secret_access_key: "please set me up!" # only if needed
capabilities:
# will register tables if found in storage but not found in the catalog (backward compatibility)
register_new_tables: True
table_location_layout: "{dataset_name}/{table_name}"
  • catalog_type=sql - this indicates, that you will use SQL-based catalog.
  • credentials=dialect+database_type://username:password@server:port/database_name - the connection string for your catalog database. This can be any SQLAlchemy-compatible database such as SQLite or PostgreSQL. For local development, a simple SQLite file like sqlite:///catalog.db works well. dlt will create it automatically if it doesn't exist.
  • filesystem.bucket_url - the physical location where Iceberg table data is stored. This can be a local directory or any cloud storage supported by the filesystem destination. If you’re using cloud storage, be sure to include the appropriate credentials as explained in the credentials setup guide. For local filesystems, no additional credentials are needed.
  • capabilities.register_new_tables=true - enables automatic registration of tables found in storage but missing in the catalog.
  • capabilities.table_location_layout - controls the directory structure for Iceberg table files. It supports two modes:
    • absolute - you provide a full URI that matches the catalog’s warehouse path, optionally including deeper subpaths.
    • relative - a path that’s appended to the catalog’s warehouse root. This is especially useful with catalogs like Lakekeeper.

The SQL catalog stores one table of the following schema:

catalog_nametable_namespacetable_namemetadata_locationprevious_metadata_location
defaultjaffle_shop_datasetorderspath/to/filespath/to/files
defaultjaffle_shop_dataset_dlt_loadspath/to/filespath/to/files

Lakekeeper catalog

Lakekeeper is an open-source, production-grade Iceberg catalog. It’s easy to set up, plays well with any cloud storage, and lets you build real data platforms without needing to set up heavy-duty infrastructure. Lakeleeper also supports vended credentials, credential vending, removing the need to pass long-lived secrets directly to dlt.

To configure Lakekeeper, you need to specify both catalog and storage parameters. The catalog handles metadata and credential vending, while the bucket_url must align with the warehouse configured in Lakekeeper.

# we recommend to put sensitive configurations to secrets.toml
destinations:
iceberg_lake:
type: iceberg
catalog_type: rest
credentials:
# we recommend to put sensitive configurations to secrets.toml
credential: my_lakekeeper_key
uri: https://lakekeeper.path.to.host/catalog
warehouse: warehouse
properties:
scope: lakekeeper
oauth2-server-uri: https://keycloak.path.to.host/realms/master/protocol/openid-connect/token
filesystem:
# bucket for s3 tables - must match Lakekeeper warehouse if defined
bucket_url: "s3://warehouse/"
capabilities:
table_root_layout: "lakekeeper-warehouse/dlt_plus_demo/lakekeeper_demo/{dataset_name}/{table_name}"
  • catalog_type=rest - specifies that you're using a REST-based catalog implementation.
  • credentials.credential - your Lakekeeper key or token used to authenticate with the catalog.
  • credentials.uri - the URL of your Lakekeeper catalog endpoint.
  • credentials.warehouse - the name of the warehouse configured in Lakekeeper, which defines the root location for all data tables.
  • credentials.properties.scope=lakekeeper - the scope required for authentication.
  • credentials.properties.oauth2-server-uri -- he URL of your OAuth2 token endpoint used for Lakekeeper authentication.
  • filesystem.bucket_url - the physical storage location for Iceberg table files. This can be any supported cloud storage backend listed in the filesystem destination.
warning

Currently, the following buckets and credentials combinations are well-tested:

  • s3: STS and signer, s3 express
  • azure: both access key and tenant-id (principal) based auth
  • google storage
  • capabilities.table_location_layout -- controls the directory structure for Iceberg table files. It supports two modes:
    • absolute - you provide a full URI that matches the catalog’s warehouse path, optionally including deeper subpaths.
    • relative - a path that’s appended to the catalog’s warehouse root. This is especially useful with catalogs like Lakekeeper.

Polaris catalog

Polaris -- is an open-source, fully-featured catalog for Iceberg. Its configuration is similar to Lakekeeper, with some differences in credential scopes and URI.

# we recommend to put sensitive configurations to secrets.toml
destinations:
iceberg_lake:
type: iceberg
catalog_type: rest
credentials:
# we recommend to put sensitive configurations to secrets.toml
credential: my_polaris_key
uri: https://account.snowflakecomputing.com/polaris/api/catalog
warehouse: warehouse
properties:
scope: PRINCIPAL_ROLE:ALL
filesystem:
# bucket for s3 tables - must match Lakekeeper warehouse if defined
bucket_url: "s3://warehouse"
capabilities:
table_root_layout: "{dataset_name}/{table_name}"

For more information, refer to the Lakekeeper section above.

Write dispositions

All write dispositions are supported.

Data access

The Iceberg destination integrates with pipeline.dataset() to give users queryable access to their data. When invoked, this creates an in-memory DuckDB database with views pointing to Iceberg tables.

The created views reflect the latest available snapshot. To ensure fresh data during development, use the always_refresh_views option. Views are materialized only on demand, based on query usage.

Credentials for data access

By default, credentials for accessing data are vended by the catalog, and per-table secrets are created automatically. This works best with cloud storage providers like AWS S3 using STS credentials. However, due to potential performance limitations with temporary credentials, we recommend defining the filesystem explicitly when working with dataset() or dlt+ transformations. This approach allows for native DuckDB filesystem access, persistent secrets, and faster data access. For example, when using AWS S3 as the storage location for your Iceberg tables, you can provide explicit credentials in the destination configuration in filesystem section:

[destination.iceberg.filesystem.credentials]
aws_access_key_id = "please set me up!"
aws_secret_access_key = "please set me up!"

This demo works on codespaces. Codespaces is a development environment available for free to anyone with a Github account. You'll be asked to fork the demo repository and from there the README guides you with further steps.
The demo uses the Continue VSCode extension.

Off to codespaces!

DHelp

Ask a question

Welcome to "Codex Central", your next-gen help center, driven by OpenAI's GPT-4 model. It's more than just a forum or a FAQ hub – it's a dynamic knowledge base where coders can find AI-assisted solutions to their pressing problems. With GPT-4's powerful comprehension and predictive abilities, Codex Central provides instantaneous issue resolution, insightful debugging, and personalized guidance. Get your code running smoothly with the unparalleled support at Codex Central - coding help reimagined with AI prowess.