Version: devel View Markdown

Hugging Face Datasets

The Hugging Face destination loads data into Hugging Face Datasets repositories. It is built on top of the filesystem destination and uses the hf:// protocol to write Parquet files to the Hugging Face Hub.

note

Because this destination extends filesystem, most filesystem concepts (file layouts, write dispositions, dlt state sync, etc.) apply here. This page covers the Hugging Face-specific behavior and configuration. Refer to the filesystem destination docs for the full feature set.

Install dlt with Hugging Face support

pip install "dlt[hf]"

This installs the huggingface_hub package alongside dlt.

Initialize the dlt project

dlt init chess filesystem

This creates a sample pipeline with the filesystem destination. To use Hugging Face, update the bucket_url in .dlt/secrets.toml to use the hf:// scheme as shown below.

Set up credentials and destination

bucket_url

The bucket_url uses the hf://datasets/<namespace> scheme, where <namespace> is your Hugging Face username or organization name.

Each dlt dataset becomes a separate Hugging Face dataset repository under that namespace. For example, loading to the dataset name chess_data with namespace my-org creates the repository my-org/chess_data.

[destination.filesystem]
bucket_url = "hf://datasets/my-org"    # replace "my-org" with your username or org

Authentication

Configure your Hugging Face User Access Token in .dlt/secrets.toml:

[destination.filesystem.credentials]
hf_token = "hf_..."    # replace with your Hugging Face User Access Token

Instead of setting hf_token in config, you can authenticate by:

Setting the HF_TOKEN environment variable
Using a locally saved token created with huggingface-cli login

Authentication is attempted in that order of priority: hf_token config → HF_TOKEN env var → locally saved token.

Private Hub endpoint

By default, https://huggingface.co is used as the API endpoint. To use a Private Hub or a self-hosted endpoint, set the HF_ENDPOINT environment variable:

export HF_ENDPOINT="https://your-private-hub.example.com"

note

dlt also supports hf_endpoint in the configuration, but this only configures the filesystem and API clients — not the dataset card operations. Use HF_ENDPOINT to ensure all operations target the correct endpoint.

Full example configuration

[destination.filesystem]
bucket_url = "hf://datasets/my-org"

[destination.filesystem.credentials]
hf_token = "hf_..."

Write disposition

The Hugging Face destination supports two write dispositions:

append — new data files are added to the dataset repository
replace — existing data files for the table are deleted, then the new files are added

warning

merge write disposition is not supported for the Hugging Face destination. Pipelines using merge will fall back to append with a warning.

File format

The Hugging Face destination always uses Parquet as the file format, regardless of other configuration. This is required because the Hugging Face dataset viewer needs Parquet files to preview datasets on the Hub.

The Parquet files are written with:

Page index (Apache Parquet page index) for efficient column statistics and skipping
Content-defined chunking for efficient versioned storage on the Hub

Table formats

The Hugging Face destination does not support Delta or Iceberg table formats.

Files layout

The Hugging Face destination uses the same layout system as the filesystem destination. The default layout is:

{table_name}/{load_id}.{file_id}.{ext}

You can customize the layout using the same layout and extra_placeholders settings as the filesystem destination. See Files layout for all available placeholders and examples.

[destination.filesystem]
bucket_url = "hf://datasets/my-org"
layout = "{table_name}/{load_id}.{file_id}.{ext}"

Hugging Face-specific behavior

Dataset repositories

Each dlt dataset creates or updates a Hugging Face dataset repository (not a directory). The repository name is <namespace>/<dataset_name>, where <namespace> comes from the bucket_url and <dataset_name> is the pipeline's dataset_name.

Dataset repository visibility

The Hugging Face dataset repositories created by dlt are public, unless your Hugging Face organization's default is private.

Dataset card

dlt creates a dataset card (i.e. the repo's README.md) without any content (but with metadata). You can manually update the dataset card to e.g. add a dataset description.

note

Do not manually change the configurations specified in the YAML metadata section. This metadata defines the subsets (see next section) and is managed by dlt.

Subsets and split

dlt creates a subset for each table in the dataset, so the dataset viewer displays each table properly. All data is loaded into the train split.

Disabling dataset card management

To disable automatic dataset card creation and metadata updates (e.g. to reduce API calls and avoid rate limits), set hf_dataset_card to false:

[destination.filesystem]
hf_dataset_card = false

When disabled, the dataset viewer will not display table subsets, but data loading is unaffected.

Atomic commits

All data files for a table and its child tables are committed to the Hub in a single git commit via the HfApi client. This minimizes the number of commits, avoids hitting Hugging Face rate limits, and prevents commit conflicts.

No table formats

The hf protocol does not support the iceberg and delta table formats.

Parquet with page index and CDC

The hf protocol defaults to parquet file format and always writes files with page index and content-defined chunking for efficient versioned storage on the Hub. This enables efficient column statistics and skipping, and is required for the Hugging Face dataset viewer to preview datasets on the Hub.

Dual client

dlt uses two Hugging Face clients together:

HfApi — used for repository management (create/delete repo) and atomic commits
HfFileSystem (fsspec) — used for file reads and DuckDB data access

The fsspec cache is invalidated after each HfApi mutation to keep the two clients consistent.

Syncing dlt state

The Hugging Face destination fully supports dlt state sync. Special files and folders (e.g., _dlt_loads, _dlt_pipeline_state) are created in the dataset repository to track pipeline state, schemas, and completed loads. These folders do not follow the layout setting.

By default, the last 100 pipeline state files are retained. You can configure this with max_state_files:

[destination.filesystem]
max_state_files = 100  # set to 0 or negative to disable cleanup

Hugging Face environment variables

You can set Hugging Face environment variables to configure the huggingface_hub library that dlt uses under the hood. For example, to increase the download timeout:

HF_HUB_DOWNLOAD_TIMEOUT="30" python run_my_dlt_pipe.py

Data access

The Hugging Face destination inherits the sql_client from the filesystem destination, which provides read-only SQL access to Parquet files using a DuckDB dialect. This also enables pipeline.dataset(), giving you Python-native access to loaded data as Pandas DataFrames, PyArrow tables, or Python tuples. See filesystem data access for details.

Example pipeline

import dlt

@dlt.resource
def my_data():
    yield [{"id": 1, "name": "Alice"}, {"id": 2, "name": "Bob"}]

# Requires bucket_url = "hf://datasets/<namespace>" in .dlt/secrets.toml
pipeline = dlt.pipeline(
    pipeline_name="my_pipeline",
    destination="filesystem",
    dataset_name="my_dataset",
)

pipeline.run(my_data(), write_disposition="append")

With the hf:// bucket URL configured, this creates or updates the my-org/my_dataset repository on Hugging Face with a Parquet file under my_data/.

Troubleshooting

Rate limit errors

Hugging Face enforces rate limits on repository commits. The dlt HF client automatically batches all file writes for a table into a single commit to minimize commit frequency. If you still hit rate limits, consider reducing the pipeline frequency or upgrading your Hugging Face plan.

Authentication errors

If you see authentication errors, verify that:

Your token has write access to the target namespace.
The token is correctly set in hf_token, HF_TOKEN, or via huggingface-cli login.
If using a Private Hub, HF_ENDPOINT is set to the correct URL.
Dataset dataset_name exists.

Hugging Face Datasets

Install dlt with Hugging Face support

Initialize the dlt project

Set up credentials and destination

bucket_url

Authentication

Private Hub endpoint

Full example configuration

Write disposition

File format

Table formats

Files layout

Hugging Face-specific behavior

Dataset repositories

Dataset repository visibility

Dataset card

Subsets and split

Disabling dataset card management

Atomic commits

No table formats

Parquet with page index and CDC

Dual client

Syncing dlt state

Hugging Face environment variables

Data access

Example pipeline

Troubleshooting

Rate limit errors

Authentication errors

DHelp

Ask a question

Install dlt with Hugging Face support​

Initialize the dlt project​

Set up credentials and destination​

bucket_url​

Authentication​

Private Hub endpoint​

Full example configuration​

Write disposition​

File format​

Table formats​

Files layout​

Hugging Face-specific behavior​

Dataset repositories​

Dataset repository visibility​

Dataset card​

Subsets and split​

Disabling dataset card management​

Atomic commits​

No table formats​

Parquet with page index and CDC​

Dual client​

Syncing dlt state​

Hugging Face environment variables​

Data access​

Example pipeline​

Troubleshooting​

Rate limit errors​

Authentication errors​

DHelp

Ask a question

Install dlt with Hugging Face support

Initialize the dlt project

Set up credentials and destination

bucket_url

Authentication

Private Hub endpoint

Full example configuration

Write disposition

File format

Table formats

Files layout

Hugging Face-specific behavior

Dataset repositories

Dataset repository visibility

Dataset card

Subsets and split

Disabling dataset card management

Atomic commits

No table formats

Parquet with page index and CDC

Dual client

Syncing dlt state

Hugging Face environment variables

Data access

Example pipeline

Troubleshooting

Rate limit errors

Authentication errors