Hugging Face Datasets
The Hugging Face destination loads data into Hugging Face Datasets repositories. It is built on top of the filesystem destination and uses the hf:// protocol to write Parquet files to the Hugging Face Hub.
Because this destination extends filesystem, most filesystem concepts (file layouts, write dispositions, dlt state sync, etc.) apply here. This page covers the Hugging Face-specific behavior and configuration. Refer to the filesystem destination docs for the full feature set.
Install dlt with Hugging Face supportโ
pip install "dlt[hf]"
This installs the huggingface_hub package alongside dlt.
Initialize the dlt projectโ
dlt init chess filesystem
This creates a sample pipeline with the filesystem destination. To use Hugging Face, update the bucket_url in .dlt/secrets.toml to use the hf:// scheme as shown below.
Set up credentials and destinationโ
bucket_urlโ
The bucket_url uses the hf://datasets/<namespace> scheme, where <namespace> is your Hugging Face username or organization name.
Each dlt dataset becomes a separate Hugging Face dataset repository under that namespace. For example, loading to the dataset name chess_data with namespace my-org creates the repository my-org/chess_data.
[destination.filesystem]
bucket_url = "hf://datasets/my-org" # replace "my-org" with your username or org
Authenticationโ
Configure your Hugging Face User Access Token in .dlt/secrets.toml:
[destination.filesystem.credentials]
hf_token = "hf_..." # replace with your Hugging Face User Access Token
Instead of setting hf_token in config, you can authenticate by:
- Setting the
HF_TOKENenvironment variable - Using a locally saved token created with
huggingface-cli login
Authentication is attempted in that order of priority: hf_token config โ HF_TOKEN env var โ locally saved token.
Private Hub endpointโ
By default, https://huggingface.co is used as the API endpoint. To use a Private Hub or a self-hosted endpoint, set the HF_ENDPOINT environment variable:
export HF_ENDPOINT="https://your-private-hub.example.com"
dlt also supports hf_endpoint in the configuration, but this only configures the filesystem and API clients โ not the dataset card operations. Use HF_ENDPOINT to ensure all operations target the correct endpoint.
Full example configurationโ
[destination.filesystem]
bucket_url = "hf://datasets/my-org"
[destination.filesystem.credentials]
hf_token = "hf_..."
Write dispositionโ
The Hugging Face destination supports two write dispositions:
appendโ new data files are added to the dataset repositoryreplaceโ existing data files for the table are deleted, then the new files are added
merge write disposition is not supported for the Hugging Face destination. Pipelines using merge will fall back to append with a warning.
File formatโ
The Hugging Face destination always uses Parquet as the file format, regardless of other configuration. This is required because the Hugging Face dataset viewer needs Parquet files to preview datasets on the Hub.
The Parquet files are written with:
- Page index (Apache Parquet page index) for efficient column statistics and skipping
- Content-defined chunking for efficient versioned storage on the Hub
Table formatsโ
The Hugging Face destination does not support Delta or Iceberg table formats.
Files layoutโ
The Hugging Face destination uses the same layout system as the filesystem destination. The default layout is:
{table_name}/{load_id}.{file_id}.{ext}
You can customize the layout using the same layout and extra_placeholders settings as the filesystem destination. See Files layout for all available placeholders and examples.
[destination.filesystem]
bucket_url = "hf://datasets/my-org"
layout = "{table_name}/{load_id}.{file_id}.{ext}"
Hugging Face-specific behaviorโ
Dataset repositoriesโ
Each dlt dataset creates or updates a Hugging Face dataset repository (not a directory). The repository name is <namespace>/<dataset_name>, where <namespace> comes from the bucket_url and <dataset_name> is the pipeline's dataset_name.
Dataset repository visibilityโ
The Hugging Face dataset repositories created by dlt are public, unless your Hugging Face organization's default is private.
Dataset cardโ
dlt creates a dataset card (i.e. the repo's README.md) without any content (but with metadata). You can manually update the dataset card to e.g. add a dataset description.
Do not manually change the configurations specified in the YAML metadata section. This metadata defines the subsets (see next section) and is managed by dlt.
Subsets and splitโ
dlt creates a subset for each table in the dataset, so the dataset viewer displays each table properly. All data is loaded into the train split.
Disabling dataset card managementโ
To disable automatic dataset card creation and metadata updates (e.g. to reduce API calls and avoid rate limits), set hf_dataset_card to false:
[destination.filesystem]
hf_dataset_card = false
When disabled, the dataset viewer will not display table subsets, but data loading is unaffected.
Atomic commitsโ
All data files for a table and its child tables are committed to the Hub in a single git commit via the HfApi client. This minimizes the number of commits, avoids hitting Hugging Face rate limits, and prevents commit conflicts.
No table formatsโ
The hf protocol does not support the iceberg and delta table formats.
Parquet with page index and CDCโ
The hf protocol defaults to parquet file format and always writes files with page index and content-defined chunking for efficient versioned storage on the Hub. This enables efficient column statistics and skipping, and is required for the Hugging Face dataset viewer to preview datasets on the Hub.
Dual clientโ
dlt uses two Hugging Face clients together:
HfApiโ used for repository management (create/delete repo) and atomic commitsHfFileSystem(fsspec) โ used for file reads and DuckDB data access
The fsspec cache is invalidated after each HfApi mutation to keep the two clients consistent.
Syncing dlt stateโ
The Hugging Face destination fully supports dlt state sync. Special files and folders (e.g., _dlt_loads, _dlt_pipeline_state) are created in the dataset repository to track pipeline state, schemas, and completed loads. These folders do not follow the layout setting.
By default, the last 100 pipeline state files are retained. You can configure this with max_state_files:
[destination.filesystem]
max_state_files = 100 # set to 0 or negative to disable cleanup
Hugging Face environment variablesโ
You can set Hugging Face environment variables to configure the huggingface_hub library that dlt uses under the hood. For example, to increase the download timeout:
HF_HUB_DOWNLOAD_TIMEOUT="30" python run_my_dlt_pipe.py
Data accessโ
The Hugging Face destination inherits the sql_client from the filesystem destination, which provides read-only SQL access to Parquet files using a DuckDB dialect. This also enables pipeline.dataset(), giving you Python-native access to loaded data as Pandas DataFrames, PyArrow tables, or Python tuples. See filesystem data access for details.
Example pipelineโ
import dlt
@dlt.resource
def my_data():
yield [{"id": 1, "name": "Alice"}, {"id": 2, "name": "Bob"}]
# Requires bucket_url = "hf://datasets/<namespace>" in .dlt/secrets.toml
pipeline = dlt.pipeline(
pipeline_name="my_pipeline",
destination="filesystem",
dataset_name="my_dataset",
)
pipeline.run(my_data(), write_disposition="append")
With the hf:// bucket URL configured, this creates or updates the my-org/my_dataset repository on Hugging Face with a Parquet file under my_data/.
Troubleshootingโ
Rate limit errorsโ
Hugging Face enforces rate limits on repository commits. The dlt HF client automatically batches all file writes for a table into a single commit to minimize commit frequency. If you still hit rate limits, consider reducing the pipeline frequency or upgrading your Hugging Face plan.
Authentication errorsโ
If you see authentication errors, verify that:
- Your token has write access to the target namespace.
- The token is correctly set in
hf_token,HF_TOKEN, or viahuggingface-cli login. - If using a Private Hub,
HF_ENDPOINTis set to the correct URL. - Dataset
dataset_nameexists.