Hugging Face Datasets
The Hugging Face destination loads data into Hugging Face Datasets repositories. It is built on top of the filesystem destination and uses the hf:// protocol to write Parquet files to the Hugging Face Hub.
Because this destination extends filesystem, most filesystem concepts (file layouts, write dispositions, dlt state sync, etc.) apply here. This page covers the Hugging Face-specific behavior and configuration. Refer to the filesystem destination docs for the full feature set.
Install dlt with Hugging Face supportโ
pip install "dlt[hf]"
This installs the huggingface_hub package alongside dlt.
Initialize the dlt projectโ
dlt init chess filesystem
This creates a sample pipeline with the filesystem destination. To use Hugging Face, update the bucket_url in .dlt/secrets.toml to use the hf:// scheme as shown below.
Set up credentials and destinationโ
bucket_urlโ
The bucket_url uses the hf://datasets/<namespace> scheme, where <namespace> is your Hugging Face username or organization name.
Each dlt dataset becomes a separate Hugging Face dataset repository under that namespace. For example, loading to the dataset name chess_data with namespace my-org creates the repository my-org/chess_data.
[destination.filesystem]
bucket_url = "hf://datasets/my-org" # replace "my-org" with your username or org
Authenticationโ
Configure your Hugging Face User Access Token in .dlt/secrets.toml:
[destination.filesystem.credentials]
hf_token = "hf_..." # replace with your Hugging Face User Access Token
Instead of setting hf_token in config, you can authenticate by:
- Setting the
HF_TOKENenvironment variable - Using a locally saved token created with
huggingface-cli login
Authentication is attempted in that order of priority: hf_token config โ HF_TOKEN env var โ locally saved token.
Private Hub endpointโ
By default, https://huggingface.co is used as the API endpoint. To use a Private Hub or a self-hosted endpoint, set the HF_ENDPOINT environment variable:
export HF_ENDPOINT="https://your-private-hub.example.com"
dlt also supports hf_endpoint in the configuration, but this only configures the filesystem and API clients โ not the dataset card operations. Use HF_ENDPOINT to ensure all operations target the correct endpoint.
Full example configurationโ
[destination.filesystem]
bucket_url = "hf://datasets/my-org"
[destination.filesystem.credentials]
hf_token = "hf_..."
Write dispositionโ
The Hugging Face destination supports two write dispositions:
appendโ new data files are added to the dataset repositoryreplaceโ existing data files for the table are deleted, then the new files are added
merge write disposition is not supported for the Hugging Face destination. Pipelines using merge will fall back to append with a warning.
File formatโ
The Hugging Face destination always uses Parquet as the file format, regardless of other configuration. This is required because the Hugging Face dataset viewer needs Parquet files to preview datasets on the Hub.
The Parquet files are written with:
- Page index (Apache Parquet page index) for efficient column statistics and skipping
- Content-defined chunking for efficient versioned storage on the Hub
Table formatsโ
The Hugging Face destination does not support Delta or Iceberg table formats.
Files layoutโ
The Hugging Face destination uses the same layout system as the filesystem destination. The default layout is:
{table_name}/{load_id}.{file_id}.{ext}
You can customize the layout using the same layout and extra_placeholders settings as the filesystem destination. See Files layout for all available placeholders and examples.
[destination.filesystem]
bucket_url = "hf://datasets/my-org"
layout = "{table_name}/{load_id}.{file_id}.{ext}"
Hugging Face-specific behaviorโ
Dataset repositoriesโ
Each dlt dataset creates or updates a Hugging Face dataset repository (not a directory). The repository name is <namespace>/<dataset_name>, where <namespace> comes from the bucket_url and <dataset_name> is the pipeline's dataset_name.
Dataset repository visibilityโ
The Hugging Face dataset repositories created by dlt are public, unless your Hugging Face organization's default is private.
Dataset cardโ
dlt creates a dataset card (i.e. the repo's README.md) without any content (but with metadata). You can manually update the dataset card to e.g. add a dataset description.
Do not manually change the configurations specified in the YAML metadata section. This metadata defines the subsets (see next section) and is managed by dlt.
Subsets and splitโ
dlt creates a subset for each table in the dataset, so the dataset viewer displays each table properly. All data is loaded into the train split.
Disabling dataset card managementโ
To disable automatic dataset card creation and metadata updates (e.g. to reduce API calls and avoid rate limits), set hf_dataset_card to false:
[destination.filesystem]
hf_dataset_card = false
When disabled, the dataset viewer will not display table subsets, but data loading is unaffected.
Atomic commitsโ
All data files for a table and its child tables are committed to the Hub in a single git commit via the HfApi client. This minimizes the number of commits, avoids hitting Hugging Face rate limits, and prevents commit conflicts.
No table formatsโ
The hf protocol does not support the iceberg and delta table formats.