Hugging Face x dltHub: The missing data layer for ML practitioners
Elvis Kahoro,
DevX & Ecosystem Lead
Thierry Jean,
Senior AI Engineer
Quentin Lhoest,
Open Source & ML/Data Engineer
## Why dltHub's new Hugging Face integration matters
Hugging Face's Dataset Hub excels at distributing and versioning datasets for machine learning, while lakehouses excel at governance, cross-dataset analysis, and SQL-heavy analytics.
Bridging these worlds has traditionally required sophisticated internal tooling and ad-hoc scripts for new projects.
Our new Hugging Face integration closes this gap and positions [dlt](http://github.com/dlt-hub/dlt), an open-source Python library for data movement (ETL), as the go-to tool for developers (and their agents) building data pipelines.
With our new Hugging Face dataset writer, pushing processed data back into the Hub is easier than ever\!
Publish and share your AI-ready data regardless of where it currently resides: BigQuery, Snowflake, LanceDB, etc.
This makes training data pipelines reproducible, destination-agnostic, and fully traceable to specific dataset revisions.
Check out our case study where we walk through how [Distil Labs](https://distillabs.ai) trained [a small (0.6B) language model from production traces](http://todo-link-to-case-study) that outperforms a 120B parameter LLM by 29 points\!
We envision this being one of many steps in forging a Builder Stack that centers developers (+ their agents): 10x builders who can now operate independently of platform teams and leverage AI to ship at the speed of thought.
The Builder Stack is our answer to the "Modern Data Stack" of the past.
The old world gave us fragmented GUIs, vendor lock-in, and opaque abstractions that required dedicated platform teams to stitch together.
The Builder Stack is the opposite: AI-native, code-first, and interoprable—built on open standards, fully under your control.
## Feature breakdown
### Load data using Hugging Face's DuckDB integration
[Hugging Face's DuckDB integration](https://huggingface.co/docs/hub/en/datasets-duckdb) combined with dltHub provides a powerful, memory-efficient way to load datasets to any destination by querying Hugging Face Parquet files directly via the `hf:// protocol`.
Easily filter and select only the columns you need at the source—avoiding unnecessary data transfer.
Batched fetching ensures that even massive datasets can be processed without exhausting memory, while dlt handles schema inference, incremental loading, and seamless delivery into your destination of choice.
In this example, we load video captions from the [OpenVID dataset into LanceDB](https://huggingface.co/datasets/lance-format/openvid-lance) for vector search.
```py
import dlt
import duckdb
HF_PARQUET_URL = "hf://datasets/lance-format/openvid-lance@~parquet/**/*.parquet"
@dlt.resource(write_disposition="replace")
def openvid_dataset(limit, batch_size,):
conn = duckdb.connect()
query = f"SELECT caption, video_id FROM '{HF_PARQUET_URL}' LIMIT {limit}"
yield conn.sql(query).arrow()
pipeline = dlt.pipeline(
pipeline_name="openvid_pipeline",
destination="lancedb",
dataset_name="openvid",
)
load_info = pipeline.run(
lancedb_adapter(
openvid_dataset(limit=10000, batch_size=100),
embed=["caption"],
),
table_name="videos",
)
print(load_info) # noqa: T201
```
Switching to another [dltHub supported destination](https://dlthub.com/docs/dlt-ecosystem/destinations) would only require changing a single line of code.
This pattern is especially valuable for ML practitioners who need to prepare training data across different environments—whether prototyping locally with DuckDB, running experiments with LanceDB embeddings, or deploying to production data warehouses.
This newfound flexibility means you can prototype, experiment, and scale your data workflows without vendor lock-in or rewriting your ingestion logic.
### Explore, prep, and curate ML datasets all in Python
Once data lands in a destination, we provide primitives for exploration and analysis with "datasets"\!
Easily query loaded data as PyArrow, Pandas, and Polars DataFrames:
```py
# Attach to an existing pipeline
pipeline = dlt.attach(
pipeline_name="openvid_pipeline",
destination="lancedb",
dataset_name="openvid",
)
# List all tables in the dataset
pipeline.dataset().tables
# Query as PyArrow
pipeline.dataset().table("videos").arrow()
```
This unified interface lets you explore and iterate on loaded data without leaving Python or writing destination-specific queries.
Visualize how your data is structured by generating Mermaid diagrams:
```py
# Get the schema as a Mermaid diagram
mermaid_code = pipeline.default_schema.to_mermaid()
print(mermaid_code)
```
Understanding your schema is just the first step—before feeding data into a training run, you need to ensure it meets quality standards.
Data quality checks let you systematically validate datasets before training:
```py
import dlthub.data_quality as dq
videos_checks = [
dq.checks.is_not_null("video_path"),
dq.checks.is_in("fps", [24, 30, 60]),
dq.checks.case("motion_score > 0"),
dq.checks.case("LENGTH(caption) >= 10"),
]
# Run checks at the table level
dq.prepare_checks(
pipeline.dataset().videos,
videos_checks,
level="table",
).arrow()
# Or create a check suite for the entire dataset
check_suite = dq.CheckSuite(
pipeline.dataset(),
checks={"videos": videos_checks}
)
```
By catching data issues early, you can avoid costly training failures and ensure your models learn from clean, consistent data.
Once your data passes quality checks, the next step is curating what goes into training; our native Ibis integration enables you to do exactly that without leaving your pipeline.
By calling `.to_ibis()` on any dataset table, you can start lazily building filters and running aggregations.
```py
# Works the same whether your destination is DuckDB, BigQuery, or LanceDB
videos_ibis = pipeline.dataset().videos.to_ibis()
# This filter compiles to native SQL/queries for your destination
filtered = videos_ibis.filter(
(videos_ibis.aesthetic_score >= 4.0)
& (videos_ibis.motion_score >= 10.0)
)
```
The key insight here is that computation runs where your data lives.
And with [marimo](https://marimo.io)—a reactive Python notebook we've partnered with—you can turn this into an interactive curation workflow where sliders and filters automatically recompute Ibis queries in real-time.
marimo also supports building charts directly from your data—histograms, scatter plots, and more—making it easy to visually explore distributions and spot outliers before training.
```py
stats = videos_ibis.aggregate(
total=ibis._.caption.count(),
avg_aesthetic=ibis._.aesthetic_score.mean(),
min_aesthetic=ibis._.aesthetic_score.min(),
max_aesthetic=ibis._.aesthetic_score.max(),
)
stats.to_pyarrow()
```
These primitives unlock powerful workflows for managing ML data at scale:
- Combine public benchmarks with proprietary logs, user feedback, or production traces in SQL
- Use your warehouse's compute for aggregations, distributions, and quality checks across millions of rows
- Build reactive notebooks with [marimo](https://marimo.io) using sliders and filters for exploring and curating training data interactively
Sharing your curated datasets with collaborators or the community should be just as easy as loading them.
That's exactly what the Hugging Face destination enables.
### Publishing datasets back to the Hub
Push data from any source to the Hub as Parquet files—built on dlt's [filesystem destination](https://dlthub.com/docs/dlt-ecosystem/destinations/filesystem) using the `hf://` protocol:
```py
import dlt
pipeline = dlt.pipeline(
pipeline_name="publish_to_hf",
destination="filesystem",
dataset_name="my_training_data",
)
# With bucket_url = "hf://datasets/my-org" configured in .dlt/secrets.toml
pipeline.run(dataset(), write_disposition="replace")
```
Each dataset becomes a separate Hugging Face repository under your namespace e.g., `my-org/my_training_data`:
- All data files for a table are committed in a single git commit, avoiding rate limits and commit conflicts
- dlt creates and maintains the repo's README.md with proper metadata
- Each table becomes a subset in the dataset viewer, so the Hub displays your data properly
- Explore your data in the Dataset Viewer before running your training experiments
- The dataset is ready to use in most training frameworks including \`transformers\` and \`unsloth\`, and many evaluations frameworks like \`lighteval\` or \`inspect\_ai\`
- Use `append` to add new data or `replace` to overwrite existing files
Configuration is minimal—add your credentials to `.dlt/secrets.toml`:
```
[destination.filesystem]
bucket_url = "hf://datasets/my-org"
[destination.filesystem.credentials]
hf_token = "hf_..." # Your Hugging Face User Access Token
```
This effectively closes the loop: extract data from any source with dlt, transform and enrich it in Python, then publish curated datasets back to the Hub for sharing with collaborators or the community.
## Benefits of dltHub for Hugging Face workflows
### Python-first: pip install dlt, run everywhere, and load anywhere (20+ destinations)
At the core, dlt is just a Python library, you don't need to meddle with helm charts or run docker compose\!
We can run dlt wherever you're already running Python and integrate it with your existing stack: Jupyter notebooks, Airflow tasks, serverless functions, and [CI pipelines like GitHub Actions](https://dlthub.com/docs/walkthroughs/deploy-a-pipeline/deploy-with-github-actions).
This fits naturally into Hugging Face workflows where practitioners already work in Python notebooks and scripts.
Instead of adopting an entire new platform, dlt slots in as a lightweight ingestion layer that speaks the same language as the rest of your ML stack—while still giving you the flexibility to write to 20+ destinations out of the box–changing where your data lands is a one-line config swap, no refactors needed.
### Compute and store embeddings alongside your data with reproducible versioned pipelines
Easily integrate and generate embeddings as part of your existing data workflow—no separate infrastructure or orchestration required.
Call Hugging Face models (like sentence-transformers/all-mpnet-base-v2) inside a [@dlt.resource transformation](https://dlthub.com/docs/general-usage/resource).
Even simpler, use our pipeline adapters to declaratively specify which columns need to be embedded:
```py
load_info = pipeline.run(
lancedb_adapter(
openvid_dataset(limit=10000, batch_size=100),
embed=["caption"],
),
table_name="videos",
)
```
The embedding model provider is configured in `.dlt/secrets.toml`, giving you flexibility to choose from providers like OpenAI, Cohere, Ollama, Sentence Transformers, Hugging Face, and more:
```
[destination.lancedb]
lance_uri = ".lancedb"
embedding_model_provider = "sentence-transformers" # or "openai", "cohere", "ollama", "huggingface", etc.
embedding_model = "all-mpnet-base-v2"
[destination.lancedb.credentials]
embedding_model_provider_api_key = "your_api_key" # Not needed for local providers like ollama or sentence-transformers
```
For self-hosted models (like Ollama), you can also specify a custom endpoint:
```
[destination.lancedb]
embedding_model_provider = "ollama"
embedding_model = "mxbai-embed-large"
embedding_model_provider_host = "http://localhost:11434"
```
Now that your Hugging Face dataset has been loaded into your destination—complete with freshly computed embeddings—you can start querying, joining, and validating your data with our Builder Stack tools that we showcased above.
And because dlt is Python-first, it's not just accessible to practitioners—it's also a natural fit for LLMs–opening the door for AI-assisted pipeline development.
### LLM-native development: Build and debug pipelines with prompts and agents
The docs, scaffolding commands, and REST source tutorials are structured for AI-assisted development.
Use \`dlt init \<source\> \<destination\>\` to create a template pipeline, and \`dlt ai\` to download skills and prompts that give your coding assistant the context it needs to help you build, debug, and extend pipelines.
Hugging Face hosts tens of thousands of datasets—far too many to manually write pipelines for each one.
LLM-assisted development means you can quickly scaffold pipelines for any of them without writing boilerplate from scratch.
And because HF datasets come with rich metadata—dataset cards, tags, splits, and configs—an LLM can use this context to generate appropriate pipeline configurations tailored to each dataset's structure.
The result: adding a new HF dataset to your pipeline becomes a conversation, not a coding task.