dltHub
dltHub for Frontier Labs

Your Training Data Quality Is Your Problem to Solve.

Frontier labs use dltHub to load agent traces, curate multilingual corpora, and push clean, training-ready datasets to HuggingFace — all in Python, all in CI/CD. Shift yourself left: catch data quality issues before they reach your training run.

50x faster prototyping
20 min end-to-end
Invite-only program
Trusted byTasman Analytics

Join the Frontier Labs Program

Tell us about your model training workflow. We'll reach out to set you up.

By submitting this form, you agree to the collection and use of your personal information by dltHub in accordance with our Privacy Policy. We value your privacy and are committed to protecting your data.

* required

Trusted by data teams at
Josh Wills
"[Placeholder] Your data quality is your own problem to solve. The tools exist to shift yourself left — test your pipelines locally, catch issues before they reach production, and stop waiting for someone else to fix it."

Josh Wills · Senior Data Engineer, Author of Apache Beam & co-creator of the Feature Store

Every Phase of Your Training Pipeline, Accelerated

dltHub handles the unglamorous work — ingestion, schema enforcement, deduplication, incremental loads — so your team can focus on models, not plumbing.

Dataset Curation & HuggingFace

Load any source — web crawls, APIs, databases — clean and normalize in-flight, and push directly to HuggingFace Hub as training-ready datasets. Native dlt destination: no custom scripts, no ad-hoc glue.

Agent Trace Ingestion for RLHF

Pull eval traces from Langfuse, Arize, or any REST API. Load as structured Python dicts, normalize nested JSON automatically, and push to HuggingFace or your warehouse for fine-tuning workflows.

Data Quality in CI/CD

Catch schema drift, duplicates, and bad batches before they corrupt a training run. Run integration tests locally with DuckDB — no expensive warehouse, no surprises in production. "Shift yourself left" as Josh Wills put it: own your data quality, don't delegate it.

Multimodal & Incremental Loads with LanceDB

Incremental ingestion means you only process new data — critical for large corpora and real-time feedback loops. Use dlt with LanceDB as a high-performance multimodal storage layer for vectors, images, audio, and structured data. dltHub handles pagination, deduplication, and state tracking so your pipelines stay lean as datasets scale to billions of tokens, and LanceDB keeps them queryable without a separate vector DB.

LanceDB Featured Destination Partner

Multimodal ELT with dltHub + LanceDB

LanceDB is the open-source multimodal vector database for AI — storing vectors, images, audio, and structured data in a single columnar format. Pair it with dlt for incremental ingestion, automatic schema normalization, and deduplication. Featured in the PyAI 2026 multimodal pipeline talk.

Explore LanceDB →

Featured Design Partner

Distil Labs

How Distil Labs Builds Task-Specific SLMs with dlt + HuggingFace

Distil Labs is a developer platform for building task-specific small language models via knowledge distillation — training SLMs with LLM-level accuracy in hours, from as few as 10 examples. Backed by AWS, NVIDIA, Google, and Project A.

Their pipeline with dlt: ingest agent traces and task data via REST API connector → normalize and deduplicate in-flight → push training-ready datasets to HuggingFace Hub → run knowledge distillation → deploy a sub-8B model matching frontier model accuracy at a fraction of the cost. dlt replaced ad-hoc ingestion scripts with a single, testable Python pipeline in CI/CD — so every training run starts from clean, versioned data.

“[Placeholder — Jacek Golebiowski quote pending] dlt gave us a clean, testable path from raw task data to HuggingFace datasets. We went from ad-hoc scripts to a production pipeline in an afternoon — and our distillation runs are cleaner for it.” — Jacek Golebiowski, CTO, Distil Labs (ex-Amazon Research)

"[Placeholder] dlt gave us a clean, testable path from raw traces to HuggingFace datasets. We went from ad-hoc scripts to a production pipeline in an afternoon — and our training runs are cleaner for it." — Jacek Golebiowski, CTO, Distil Labs (ex-Amazon Research)

Hugging Face Featured Destination Partner

dltHub's Native HuggingFace Hub Destination

Push training-ready datasets directly from any dlt pipeline to HuggingFace Hub — schema-enforced, deduplicated, and version-controlled. No upload scripts. No manual wrangling. One Python pipeline from raw source to Hub dataset.

View HuggingFace Destination Docs →

Program Benefits

HuggingFace Destination (Alpha)

Early access to the native dlt → HuggingFace Hub destination. Push AI-ready datasets directly from any source to HuggingFace in a single pipeline, with schema enforcement and deduplication built in.

Dedicated Engineering Support

A private Slack channel with the dltHub engineering team. Your use case shapes the product roadmap. Joint go-to-market with dltHub and HuggingFace.

Prototype in 20 Minutes

From dlt init to a running pipeline in minutes. Connect to any REST API, normalize nested JSON, and push to your destination — no custom infrastructure, no boilerplate.

Why Data Quality Is Your Problem to Solve

Josh Wills at our SF meetup: stop waiting for someone else to fix data quality — shift yourself left with dlt + DuckDB + CI/CD.

Featured Talk · PyAI Conference · March 10, 2026 · San Francisco

Matthaus Krzykowski

Building Production-Ready Multimodal ELT Pipelines for Hugging Face with dlt, Lance, and Ibis

Matthaus Krzykowski · CEO & Co-founder, dltHub

AI agents can generate pipeline scripts in seconds — but generating a script isn't the same as running one in production. This talk demos a production-ready multimodal ELT pipeline using dlt for standardized extraction, LanceDB as a high-performance multimodal storage layer, Ibis for complex preprocessing, and dltHub data quality checks — with automated publishing to Hugging Face.

View Talk Details →

Frequently Asked Questions

What is the HuggingFace destination and when is it available?

The dlt HuggingFace destination lets you push structured, training-ready datasets directly to HuggingFace Hub from any dlt pipeline. It’s in alpha — Frontier Labs program members get early access.

What data sources can I load from?

Any REST API, database, file store, or Python object. dlt has 70+ pre-built sources and a REST API connector that prototypes any new source in minutes. Common frontier lab sources: Langfuse, Arize, web crawl outputs, custom model APIs.

How does dlt help with data quality for training?

dlt enforces schema at ingestion time, auto-detects and flattens nested JSON, handles deduplication and incremental loads, and provides detailed pipeline logs. Test your full pipeline locally with DuckDB in CI/CD — catch bad batches before they reach a training run.

What happens after I apply?

We’ll review your application and reach out within a few business days. You can also book a call directly after applying. High-engagement teams get onboarded within a week with a dedicated dltHub engineer.