Skip to main content
Version: devel View Markdown

Hugging Face Datasets

The Hugging Face destination loads data into Hugging Face Datasets repositories. It is built on top of the filesystem destination and uses the hf:// protocol to write Parquet files to the Hugging Face Hub.

note

Because this destination extends filesystem, most filesystem concepts (file layouts, write dispositions, dlt state sync, etc.) apply here. This page covers the Hugging Face-specific behavior and configuration. Refer to the filesystem destination docs for the full feature set.

Install dlt with Hugging Face supportโ€‹

pip install "dlt[hf]"

This installs the huggingface_hub package alongside dlt.

Initialize the dlt projectโ€‹

dlt init chess filesystem

This creates a sample pipeline with the filesystem destination. To use Hugging Face, update the bucket_url in .dlt/secrets.toml to use the hf:// scheme as shown below.

Set up credentials and destinationโ€‹

bucket_urlโ€‹

The bucket_url uses the hf://datasets/<namespace> scheme, where <namespace> is your Hugging Face username or organization name.

Each dlt dataset becomes a separate Hugging Face dataset repository under that namespace. For example, loading to the dataset name chess_data with namespace my-org creates the repository my-org/chess_data.

[destination.filesystem]
bucket_url = "hf://datasets/my-org" # replace "my-org" with your username or org

Authenticationโ€‹

Configure your Hugging Face User Access Token in .dlt/secrets.toml:

[destination.filesystem.credentials]
hf_token = "hf_..." # replace with your Hugging Face User Access Token

Instead of setting hf_token in config, you can authenticate by:

Authentication is attempted in that order of priority: hf_token config โ†’ HF_TOKEN env var โ†’ locally saved token.

Private Hub endpointโ€‹

By default, https://huggingface.co is used as the API endpoint. To use a Private Hub or a self-hosted endpoint, set the HF_ENDPOINT environment variable:

export HF_ENDPOINT="https://your-private-hub.example.com"
note

dlt also supports hf_endpoint in the configuration, but this only configures the filesystem and API clients โ€” not the dataset card operations. Use HF_ENDPOINT to ensure all operations target the correct endpoint.

Full example configurationโ€‹

[destination.filesystem]
bucket_url = "hf://datasets/my-org"

[destination.filesystem.credentials]
hf_token = "hf_..."

Write dispositionโ€‹

The Hugging Face destination supports two write dispositions:

  • append โ€” new data files are added to the dataset repository
  • replace โ€” existing data files for the table are deleted, then the new files are added
warning

merge write disposition is not supported for the Hugging Face destination. Pipelines using merge will fall back to append with a warning.

File formatโ€‹

The Hugging Face destination always uses Parquet as the file format, regardless of other configuration. This is required because the Hugging Face dataset viewer needs Parquet files to preview datasets on the Hub.

The Parquet files are written with:

Table formatsโ€‹

The Hugging Face destination does not support Delta or Iceberg table formats.

Files layoutโ€‹

The Hugging Face destination uses the same layout system as the filesystem destination. The default layout is:

{table_name}/{load_id}.{file_id}.{ext}

You can customize the layout using the same layout and extra_placeholders settings as the filesystem destination. See Files layout for all available placeholders and examples.

[destination.filesystem]
bucket_url = "hf://datasets/my-org"
layout = "{table_name}/{load_id}.{file_id}.{ext}"

Hugging Face-specific behaviorโ€‹

Dataset repositoriesโ€‹

Each dlt dataset creates or updates a Hugging Face dataset repository (not a directory). The repository name is <namespace>/<dataset_name>, where <namespace> comes from the bucket_url and <dataset_name> is the pipeline's dataset_name.

Dataset repository visibilityโ€‹

The Hugging Face dataset repositories created by dlt are public, unless your Hugging Face organization's default is private.

Dataset cardโ€‹

dlt creates a dataset card (i.e. the repo's README.md) without any content (but with metadata). You can manually update the dataset card to e.g. add a dataset description.

note

Do not manually change the configurations specified in the YAML metadata section. This metadata defines the subsets (see next section) and is managed by dlt.

Subsets and splitโ€‹

dlt creates a subset for each table in the dataset, so the dataset viewer displays each table properly. All data is loaded into the train split.

Disabling dataset card managementโ€‹

To disable automatic dataset card creation and metadata updates (e.g. to reduce API calls and avoid rate limits), set hf_dataset_card to false:

[destination.filesystem]
hf_dataset_card = false

When disabled, the dataset viewer will not display table subsets, but data loading is unaffected.

Atomic commitsโ€‹

All data files for a table and its child tables are committed to the Hub in a single git commit via the HfApi client. This minimizes the number of commits, avoids hitting Hugging Face rate limits, and prevents commit conflicts.

No table formatsโ€‹

The hf protocol does not support the iceberg and delta table formats.

Parquet with page index and CDCโ€‹

The hf protocol defaults to parquet file format and always writes files with page index and content-defined chunking for efficient versioned storage on the Hub. This enables efficient column statistics and skipping, and is required for the Hugging Face dataset viewer to preview datasets on the Hub.

Dual clientโ€‹

dlt uses two Hugging Face clients together:

  • HfApi โ€” used for repository management (create/delete repo) and atomic commits
  • HfFileSystem (fsspec) โ€” used for file reads and DuckDB data access

The fsspec cache is invalidated after each HfApi mutation to keep the two clients consistent.

Syncing dlt stateโ€‹

The Hugging Face destination fully supports dlt state sync. Special files and folders (e.g., _dlt_loads, _dlt_pipeline_state) are created in the dataset repository to track pipeline state, schemas, and completed loads. These folders do not follow the layout setting.

By default, the last 100 pipeline state files are retained. You can configure this with max_state_files:

[destination.filesystem]
max_state_files = 100 # set to 0 or negative to disable cleanup

Hugging Face environment variablesโ€‹

You can set Hugging Face environment variables to configure the huggingface_hub library that dlt uses under the hood. For example, to increase the download timeout:

HF_HUB_DOWNLOAD_TIMEOUT="30" python run_my_dlt_pipe.py

Data accessโ€‹

The Hugging Face destination inherits the sql_client from the filesystem destination, which provides read-only SQL access to Parquet files using a DuckDB dialect. This also enables pipeline.dataset(), giving you Python-native access to loaded data as Pandas DataFrames, PyArrow tables, or Python tuples. See filesystem data access for details.

Example pipelineโ€‹

import dlt

@dlt.resource
def my_data():
yield [{"id": 1, "name": "Alice"}, {"id": 2, "name": "Bob"}]

# Requires bucket_url = "hf://datasets/<namespace>" in .dlt/secrets.toml
pipeline = dlt.pipeline(
pipeline_name="my_pipeline",
destination="filesystem",
dataset_name="my_dataset",
)

pipeline.run(my_data(), write_disposition="append")

With the hf:// bucket URL configured, this creates or updates the my-org/my_dataset repository on Hugging Face with a Parquet file under my_data/.

Troubleshootingโ€‹

Rate limit errorsโ€‹

Hugging Face enforces rate limits on repository commits. The dlt HF client automatically batches all file writes for a table into a single commit to minimize commit frequency. If you still hit rate limits, consider reducing the pipeline frequency or upgrading your Hugging Face plan.

Authentication errorsโ€‹

If you see authentication errors, verify that:

  1. Your token has write access to the target namespace.
  2. The token is correctly set in hf_token, HF_TOKEN, or via huggingface-cli login.
  3. If using a Private Hub, HF_ENDPOINT is set to the correct URL.
  4. Dataset dataset_name exists.

This demo works on codespaces. Codespaces is a development environment available for free to anyone with a Github account. You'll be asked to fork the demo repository and from there the README guides you with further steps.
The demo uses the Continue VSCode extension.

Off to codespaces!

DHelp

Ask a question

Welcome to "Codex Central", your next-gen help center, driven by OpenAI's GPT-4 model. It's more than just a forum or a FAQ hub โ€“ it's a dynamic knowledge base where coders can find AI-assisted solutions to their pressing problems. With GPT-4's powerful comprehension and predictive abilities, Codex Central provides instantaneous issue resolution, insightful debugging, and personalized guidance. Get your code running smoothly with the unparalleled support at Codex Central - coding help reimagined with AI prowess.