Skip to main content
Version: devel

Iceberg table format

dlt supports writing Iceberg tables when using the filesystem destination.

How it works​

dlt uses the PyIceberg library to write Iceberg tables. One or multiple Parquet files are prepared during the extract and normalize steps. In the load step, these Parquet files are exposed as an Arrow data structure and fed into pyiceberg.

Iceberg single-user ephemeral catalog​

dlt uses single-table, ephemeral, in-memory, SQLite-based Iceberg catalogs. These catalogs are created "on demand" when a pipeline is run, and do not persist afterwards. If a table already exists in the filesystem, it gets registered into the catalog using its latest metadata file. This allows for a serverless setup. It is currently not possible to connect your own Iceberg catalog.

caution

While ephemeral catalogs make it easy to get started with Iceberg, it comes with limitations:

  • concurrent writes are not handled and may lead to corrupt table state
  • we cannot guarantee that reads concurrent with writes are clean
  • the latest manifest file needs to be searched for using file listingβ€”this can become slow with large tables, especially in cloud object stores
dlt+

If you're interested in a multi-user cloud experience and integration with vendor catalogs, such as Polaris or Unity Catalog, check out dlt+.

Iceberg dependencies​

You need Python version 3.9 or higher and the pyiceberg package to use this format:

pip install "dlt[pyiceberg]"

You also need sqlalchemy>=2.0.18:

pip install 'sqlalchemy>=2.0.18'

Additional permissions for Iceberg​

When using Iceberg with object stores like S3, additional permissions may be required for operations like multipart uploads and tagging. Make sure your IAM role or user has the following permissions:

[
"s3:ListBucketMultipartUploads",
"s3:GetBucketLocation",
"s3:AbortMultipartUpload",
"s3:PutObjectTagging",
"s3:GetObjectTagging"
]

Set table format​

Set the table_format argument to iceberg when defining your resource:

@dlt.resource(table_format="iceberg")
def my_iceberg_resource():
...

or when calling run on your pipeline:

pipeline.run(my_resource, table_format="iceberg")
note

dlt always uses Parquet as loader_file_format when using the iceberg table format. Any setting of loader_file_format is disregarded.

Table format partitioning​

Iceberg tables can be partitioned by specifying one or more partition column hints. This example partitions an Iceberg table by the foo column:

@dlt.resource(
table_format="iceberg",
columns={"foo": {"partition": True}}
)
def my_iceberg_resource():
...
note

Iceberg uses hidden partioning.

caution

Partition evolution (changing partition columns after a table has been created) is not supported.

Table access helper functions​

You can use the get_iceberg_tables helper function to access native table objects. These are pyiceberg Table objects.

from dlt.common.libs.pyiceberg import get_iceberg_tables

# get dictionary of Table objects
iceberg_tables = get_iceberg_tables(pipeline)

# execute operations on Table objects
# etc.

Google Cloud Storage authentication​

Note that not all authentication methods are supported when using Iceberg on Google Cloud Storage:

note

The S3-compatible interface for Google Cloud Storage is not supported when using iceberg.

Iceberg Azure scheme​

The az scheme is not supported when using the iceberg table format. Please use the abfss scheme. This is because pyiceberg, which dlt used under the hood, currently does not support az.

Table format merge support​

The merge write disposition is not supported for Iceberg and falls back to append. If you're interested in support for the merge write disposition with Iceberg, check out dlt+ Iceberg destination.

This demo works on codespaces. Codespaces is a development environment available for free to anyone with a Github account. You'll be asked to fork the demo repository and from there the README guides you with further steps.
The demo uses the Continue VSCode extension.

Off to codespaces!

DHelp

Ask a question

Welcome to "Codex Central", your next-gen help center, driven by OpenAI's GPT-4 model. It's more than just a forum or a FAQ hub – it's a dynamic knowledge base where coders can find AI-assisted solutions to their pressing problems. With GPT-4's powerful comprehension and predictive abilities, Codex Central provides instantaneous issue resolution, insightful debugging, and personalized guidance. Get your code running smoothly with the unparalleled support at Codex Central - coding help reimagined with AI prowess.