dltHub
Blog /

Hugging Face x dltHub: The missing data layer for ML practitioners

  • Elvis Kahoro,
    DevX & Ecosystem Lead
  • Thierry Jean,
    Senior AI Engineer
  • Quentin Lhoest,
    Open Source & ML/Data Engineer

Why dltHub's new Hugging Face integration matters

ML practitioners need to know exactly what data trained their models—for reproducibility, compliance, and debugging. Hugging Face's Dataset Hub excels at distributing and versioning datasets for machine learning. But ML data doesn't live in one place—it's scattered across production systems, warehouses, and internal stores. To wrangle this complexity, many teams have turned to lakehouses as their single source of truth. And for good reason—lakehouses excel at governance, cross-dataset analysis, and SQL-heavy analytics.

These two worlds have remained completely siloed. Connecting them has always meant building sophisticated internal tooling accompanied with ad-hoc scripts—with no reusable solution in sight. Our new Hugging Face integration closes this gap and positions dlt, an open-source Python library for data movement (ETL), as the go-to tool for developers (and their agents) building data pipelines.

With our new Hugging Face dataset writer, pushing processed data back into the Hub is easier than ever! Publish and share your AI-ready data regardless of where it currently resides: BigQuery, Snowflake, LanceDB, etc. This makes training data pipelines reproducible, destination-agnostic, and fully traceable to specific dataset revisions. Check out our case study where we walk through how Distil Labs trained a small (0.6B) language model from production traces that outperforms a 120B parameter LLM by 29 points!

We envision this being one of many steps in forging a Builder Stack that centers developers (+ their agents): 10x builders who can now operate independently of platform teams and leverage AI to ship at the speed of thought.

Feature breakdown

Load data using Hugging Face's DuckDB integration

Hugging Face's DuckDB integration combined with dltHub provides a powerful, memory-efficient way to load datasets to any destination by querying Hugging Face Parquet files directly via the hf:// protocol. Easily filter and select only the columns you need at the source—avoiding unnecessary data transfer.

Batched fetching ensures that even massive datasets can be processed without exhausting memory, while dlt handles schema inference, incremental loading, and seamless delivery into your destination of choice. In this example, we load video captions from the OpenVID dataset into LanceDB for vector search.

import dlt
import duckdb

HF_PARQUET_URL = "hf://datasets/lance-format/openvid-lance@~parquet/**/*.parquet"

@dlt.resource(write_disposition="replace")
def openvid_dataset(limit, batch_size,):
    conn = duckdb.connect()
    query = f"SELECT caption, video_id FROM '{HF_PARQUET_URL}' LIMIT {limit}"
    yield conn.sql(query).arrow()

pipeline = dlt.pipeline(
    pipeline_name="openvid_pipeline",
    destination="lancedb",
    dataset_name="openvid",
)
pipeline.run(
    lancedb_adapter(
        openvid_dataset(limit=10000, batch_size=100),
        embed=["caption"],
    ),
    table_name="videos",
)

Switching to another dltHub supported destination would only require changing a single line of code. This pattern is especially valuable for ML practitioners who need to prepare training data across different environments—whether prototyping locally with DuckDB, running experiments with LanceDB embeddings, or deploying to production data warehouses.

This newfound flexibility enables you to scale your data workflows without vendor lock-in or rewriting your ingestion logic.

Explore, prep, and curate ML datasets all in Python

Once data lands in a destination, we provide primitives for exploration and analysis with pipeline.dataset()! Easily query loaded data as PyArrow, Pandas, and Polars DataFrames:

# Attach to an existing pipeline
pipeline = dlt.attach(
    pipeline_name="openvid_pipeline",
    destination="lancedb",
    dataset_name="openvid",
)

# List all tables in the dataset
pipeline.dataset().tables
# Query as PyArrow
pipeline.dataset().table("videos").arrow()

This unified interface lets you explore and iterate on loaded data without leaving Python or writing destination-specific queries. Easily visualize how your data is structured by generating Mermaid diagrams:

# Get the schema as a Mermaid diagram
mermaid_code = pipeline.default_schema.to_mermaid()
print(mermaid_code)

Understanding your schema is just the first step—before feeding data into a training run, you need to ensure it meets quality standards. Data quality checks let you systematically validate datasets before training:

import dlthub.data_quality as dq

videos_checks = [
    dq.checks.is_not_null("video_path"),
    dq.checks.is_in("fps", [24, 30, 60]),
    dq.checks.case("motion_score > 0"),
    dq.checks.case("LENGTH(caption) >= 10"),
]

# Run checks at the table level
dq.prepare_checks(
    pipeline.dataset().videos,
    videos_checks,
    level="table",
).arrow()

# Or create a check suite for the entire dataset
check_suite = dq.CheckSuite(
    pipeline.dataset(),
    checks={"videos": videos_checks}
)

By catching data issues early, you can avoid costly training failures and ensure your models learn from clean, consistent data.

Once your data passes quality checks, the next step is curating what goes into training; our native Ibis integration enables you to do exactly that without leaving your pipeline.

By calling .to_ibis() on any dataset table, you can start lazily building filters and running aggregations.

# Works the same whether your destination is DuckDB, BigQuery, or LanceDB
videos_ibis = pipeline.dataset().videos.to_ibis()

# This filter compiles to native SQL/queries for your destination
filtered = videos_ibis.filter(
    (videos_ibis.aesthetic_score >= 4.0)
    & (videos_ibis.motion_score >= 10.0)
)

The key insight here is that computation runs where your data lives.

And with marimo—a reactive Python notebook we've partnered with—you can turn this into an interactive curation workflow where sliders and filters automatically recompute Ibis queries in real-time. Even better, Hugging Face Spaces natively support marimo notebooks, so you can deploy your notebooks as shareable web apps—perfect for collaborating on dataset exploration with your team and the community.

stats = videos_ibis.aggregate(
    total=ibis._.caption.count(),
    avg_aesthetic=ibis._.aesthetic_score.mean(),
    min_aesthetic=ibis._.aesthetic_score.min(),
    max_aesthetic=ibis._.aesthetic_score.max(),
)
stats.to_pyarrow()

These primitives unlock powerful workflows for managing ML data at scale:

  • Combine public benchmarks with proprietary logs, user feedback, or production traces in SQL
  • Use your warehouse's compute for aggregations, distributions, and quality checks across millions of rows
  • Build reactive notebooks with marimo using sliders and filters for exploring and curating training data interactively

Sharing your curated datasets with collaborators or the community should be just as easy as loading them. That's exactly what the Hugging Face destination enables.

Publishing datasets back to the Hub

Push data from any source to the Hub as Parquet files—built on dlt's filesystem destination using the hf:// protocol:

import dlt

pipeline = dlt.pipeline(
    pipeline_name="publish_to_hf",
    destination="filesystem",
    dataset_name="my_training_data",
)
# With bucket_url = "hf://datasets/my-org" configured in .dlt/secrets.toml
pipeline.run(dataset(), write_disposition="replace")

Each dataset becomes a separate Hugging Face repository under your namespace e.g., my-org/my_training_data:

  • All data files for a table are committed in a single git commit, avoiding rate limits and commit conflicts
  • dlt creates and maintains the repo's README.md with proper metadata
  • Each table becomes a subset in the dataset viewer, so the Hub displays your data properly
  • Explore your data in the Dataset Viewer before running your training experiments
  • The dataset is ready to use in most training frameworks including transformers and unsloth, and many evaluations frameworks like lighteval or inspect_ai
  • Use append to add new data or replace to overwrite existing files

Configuration is minimal—add your credentials to .dlt/secrets.toml:

[destination.filesystem]
bucket_url = "hf://datasets/my-org"

[destination.filesystem.credentials]
hf_token = "hf_..."  # Your Hugging Face User Access Token

This effectively closes the loop: extract data from any source with dlt, transform and enrich it in Python, then publish curated datasets back to the Hub for sharing with collaborators or the community.

Benefits of dltHub for Hugging Face workflows

Python-first: pip install dlt, run everywhere, and load anywhere (20+ destinations)

At the core, dlt is just a Python library, you don't need to meddle with helm charts or run docker compose! We can run dlt wherever you're already running Python and integrate it with your existing stack: Jupyter notebooks, Airflow tasks, serverless functions, and CI pipelines like GitHub Actions. This fits naturally into Hugging Face workflows where practitioners already work in Python notebooks and scripts.

Instead of adopting an entire new platform, dlt slots in as a lightweight ingestion layer that speaks the same language as the rest of your ML stack—while still giving you the flexibility to write to 20+ destinations out of the box–changing where your data lands is a one-line config swap, no refactors needed.

Compute and store embeddings alongside your data with reproducible versioned pipelines

Easily integrate and generate embeddings as part of your existing data workflow—no separate infrastructure or orchestration required. Call Hugging Face models (like sentence-transformers/all-mpnet-base-v2) inside a @dlt.resource transformation. Even simpler, use our pipeline adapters to declaratively specify which columns need to be embedded. The lancedb_adapter wraps a dlt resource and tells LanceDB which text columns should be converted into vector embeddings during loading:

from dlt.destinations.adapters import lancedb_adapter

# Wrap your resource with lancedb_adapter to enable automatic embedding generation.
# The `embed` parameter accepts a column name (or list of column names) whose text
# content will be vectorized using the model configured in .dlt/secrets.toml.
load_info = pipeline.run(
    lancedb_adapter(
        openvid_dataset(limit=10000, batch_size=100),  # Your dlt resource
        embed=["caption"],  # Column(s) to generate embeddings for
    ),
    table_name="videos",
)

The embedding model provider is configured in .dlt/secrets.toml, giving you flexibility to choose from providers like OpenAI, Cohere, Ollama, Sentence Transformers, Hugging Face, and more:

[destination.lancedb]
lance_uri = ".lancedb"
embedding_model_provider = "sentence-transformers"  # or "openai", "cohere", "ollama", "huggingface", etc.
embedding_model = "all-mpnet-base-v2"

[destination.lancedb.credentials]
embedding_model_provider_api_key = "your_api_key"  # Not needed for local providers like ollama or sentence-transformers

For self-hosted models (like Ollama), you can also specify a custom endpoint:

[destination.lancedb]
embedding_model_provider = "ollama"
embedding_model = "mxbai-embed-large"
embedding_model_provider_host = "http://localhost:11434"

Now that your Hugging Face dataset has been loaded into your destination—complete with freshly computed embeddings—you can start querying, joining, and validating your data with our Builder Stack tools that we showcased above.

And because dlt is Python-first, it's not just accessible to practitioners—it's also a natural fit for LLMs–opening the door for AI-assisted pipeline development.

LLM-native development: Build and debug pipelines with prompts and agents

Coding assistants don't know how to build data pipelines—until they're given the right tools. We built an AI workbench with structured, guided workflows for each phase of the data engineering lifecycle. Instead of generating ad-hoc code, your assistant follows a defined sequence of steps from start to finish: ingesting from a REST API, exploring loaded data, or deploying to production.

Each toolkit contains skills, commands, rules, and an MCP server that exposes your pipelines, schemas, and tables as tools the assistant can call—so it can answer questions like "what tables were loaded?" or "show me the last pipeline trace" without needing you to run your own investigations, commands, and eventually copy-pasting output into the chat.

Get started with dlt ai init to install toolkits for your coding assistant (Claude Code, Cursor, or Codex), then kick off a workflow:

  • Ingest data by scaffolding, debugging, and validating pipelines that pull from APIs, databases, and files — "load data from the Stripe API into DuckDB"
  • Explore loaded data, run queries, and generate interactive marimo dashboards — "show me what's in the orders table"
  • Deploy pipelines to production on the dltHub platform — "deploy my pipeline to dltHub"

This iterative approach—build locally, explore and validate, then deploy—fits naturally into ML workflows where you're constantly refining what data goes into training. Hugging Face hosts tens of thousands of datasets—far too many to manually write pipelines for each one. AI-driven development means you can quickly scaffold pipelines for any of them without writing boilerplate from scratch. And because HF datasets come with rich metadata—dataset cards, tags, splits, and configs—an agent can use this context to generate appropriate pipeline configurations tailored to each dataset's structure.

The result: adding a new HF dataset to your pipeline becomes a conversation, not a coding task.

What's next

Verified dlt source for native Hugging Face dataset loading

While the DuckDB + hf:// protocol approach shown above is powerful, we're building something even more seamless: a verified dlt source that wraps Hugging Face's datasets library directly. This will let you load any of the 200,000+ datasets on the Hub with a single function call—no DuckDB queries required:

import dlt
from dlt.sources.huggingface import hf_dataset  # Import the verified Hugging Face dataset source

# Setup the DLT pipeline with desired configuration
pipeline = dlt.pipeline(
    pipeline_name="hf_ingestion",
    destination="lancedb",  # or postgres, duckdb, bigquery, etc.
    dataset_name="openvid",
)
# Run the pipeline and ingest the dataset using the verified source
pipeline.run(
    hf_dataset("lance-format/openvid-lance", split="train")
) # Load the specified dataset

The verified source will handle authentication, pagination, and schema inference automatically—while still giving you the flexibility to swap destinations with a one-line config change. This is especially valuable for ML practitioners who need to prototype with local DuckDB, experiment with LanceDB embeddings, and deploy to production warehouses—all without rewriting ingestion logic.

From prototype to production with dltHub Pro

Everything we've shown in this post—loading Hugging Face datasets, computing embeddings, validating data quality, curating with Ibis, and publishing back to the Hub—can be done today with dlt. But taking these workflows to production still means stitching together scheduling, monitoring, and deployment infrastructure on your own.

We built dltHub Pro to remove the operational overhead of standing up infrastructure and orchestrators so you can focus on what matters: building and iterating on your data pipelines. Simply put, your working local pipeline runs as-is on managed, enterprise-grade infrastructure: swap out your local DuckDB for a production data warehouse, managed lakehouse (Iceberg, DuckLake), or custom destination, and deployed without rewriting a single line of code.

This is the Builder Stack in action. Where the Modern Data Stack gave us fragmented GUIs and vendor lock-in, dltHub Pro gives us a code-first, AI-native workspace where a single developer can deliver what previously required an entire platform team. Whether you're a consultant prototyping a client connector in 20 minutes, an ML engineer curating training data across multiple sources, or a solo data practitioner who is the data team—dltHub Pro meets you where you already work: in Python, in your terminal, building alongside AI.

Sign up for early access and deploy your first pipeline in minutes.