dltHub
Blog /

Your Traces Aren't Training Data Yet. Here's the Pipeline That Makes Them.

  • Alena Astrakhantseva,
    DevRel
  • Martin Bach,
    Commercial GTM Lead

If your team runs an LLM-powered agent in production, every request it handles is a test case you never had to write. That trace data describes your problem space better than any hand-crafted dataset: the vocabulary your users actually use, the edge cases that show up in real traffic, the real distribution of requests.

This article shows how to build a data pipeline with dlt that extracts and normalizes those traces from wherever they live, lands them as a versioned Parquet dataset on Hugging Face, and hands them off to Distil Labs to train a compact expert model, one that outperforms the general-purpose LLM you're running today, at a fraction of the cost.

The two problems that block most fine-tuning projects

The first problem is access. Production traces are scattered across databases, log aggregators, and cloud storage buckets in incompatible formats. Getting them into a single clean, structured dataset requires real extraction and normalization work, exactly what dlt was built for.

The second problem is that raw traces are not training data. They're noisy: imbalanced class distributions, malformed outputs from bad inference runs, responses that were technically logged but factually wrong. To turn them into a working model you still need to curate seed examples, generate synthetic training data at scale, fine-tune, evaluate, and deploy.

dlt solves the first problem. It connects to any data source: Postgres, S3, BigQuery, REST APIs, log aggregators, and delivers clean, structured traces to Hugging Face as a destination in a consistent format, regardless of where they originated. Distil Labs solves the second: it takes those traces as domain context, uses them to steer a large teacher model's synthetic data generation, and produces a fine-tuned specialist ready for deployment.

How the pipeline works: dlt → Hugging Face → Distil Labs

Three tools, each doing one job. The handoff between them is clean and the contract is simple: structured Parquet on Hugging Face.

dlt: extract, normalize, load to Hugging Face

dlt connects to your production data store, any database, cloud storage bucket, API, or log aggregator, and writes cleaned, structured traces to Hugging Face as a versioned Parquet dataset. You write a standard dlt pipeline. The destination is Hugging Face. Everything downstream stays the same regardless of where your data lives. Swap the source connector and the rest of the pipeline doesn't change.

This is the key dlt pattern here: you can load data from any of dlt's verified sources (Postgres, Snowflake, S3, BigQuery, local files, REST APIs) and land it directly on Hugging Face. The source connector is the only thing that changes between projects; the transformation logic and the HF destination remain the same. That means you get a reusable, production-grade pipeline template for any trace extraction job.

Hugging Face: the shared hub

Hugging Face acts as the handoff point between the data pipeline and the training step. The cleaned trace dataset lands there after the dlt step, versioned and immediately accessible. The trained model gets published back there after training. Both artifacts, dataset and model, are available to the rest of your stack via the HF API.

Huggingface and dltHub just announced the official dlt huggingface destination.

Distil Labs: from traces to a deployed specialist

Distil Labs reads the trace dataset from Hugging Face and uses it as domain context. The important thing: it's not training on your raw traces directly. It feeds them to a large teacher model as context, so the synthetic training data it generates reflects your vocabulary, your function schemas, and your users' phrasing patterns, not just the model's generic priors. The student is then fine-tuned on that synthetic dataset and published back to Hugging Face.

The dlt pipeline in detail

The core pattern is straightforward and reusable across any trace extraction project.

In our walkthrough we used the Amazon MASSIVE dataset as a stand-in for production traffic, 16,000+ natural language utterances across 60 intents. We filtered to an IoT scenario (commands like "turn on the kitchen lights" or "make me a coffee at 7am") covering 9 functions. But the pipeline pattern is the same whether your traces come from a Postgres database, an S3 bucket full of JSONL files, or a REST API.

The dlt pipeline does three things:

  1. Connects to the source and filters to the relevant traces (in our case, IoT-scenario utterances)
  2. Formats each record as an OpenAI function-calling conversation trace, a structured (input, output) pair that downstream tools can consume
  3. Loads the result to Hugging Face as a versioned Parquet dataset using dlt's Hugging Face destination

The result in our walkthrough: 1,107 IoT conversation traces, versioned and accessible on Hugging Face, ready for the next stage.

The Hugging Face destination in dlt handles Parquet serialization, dataset versioning, and upload in one step. No manual conversion or HF API calls required. Check the dlt docs for the full list of verified sources you can connect to.

From traces to training data to deployed model

Once the traces are on Hugging Face, Distil Labs takes over in two stages:

Seed curation An LLM judge scores each trace on inference clarity and utterance coherence, keeping only perfect-scoring examples as seeds. The remaining traces go into an unstructured context file for synthetic data generation. This step prevents data leakage by excluding sampled seed examples from the context set.

Training The Distil CLI uploads the seed data and unstructured traces. A teacher model (GPT-OSS-120B) reads the traces as domain context and generates ~10,000 synthetic training examples, each validated and filtered. A student model (Qwen3-0.6B) is fine-tuned on the result. Training completes in under 12 hours. The trained model is published straight back to Hugging Face.

Results

distil labs trained a Qwen3-0.6B student distilled from an openai.gpt-oss-120B teacher, using traces extracted by dlt, stored on Hugging Face, and used by Distil Labs as domain context for synthetic data generation.

ModelTool Call Equivalence (↑)Parameters
Teacher (openai.gpt-oss-120B)50.6%120B
Base student (Qwen3-0.6B, no fine-tuning)9.6%0.6B
Tuned student (Qwen3-0.6B, Distil Labs)78.3%0.6B

The tuned 0.6B model beats the 120B teacher by 28 points on exact structured match. The teacher scores lower because it's a general-purpose model that has never seen your specific function schemas or phrasing patterns. The student, trained on synthetic data grounded in real traffic extracted by dlt, is a specialist in exactly this task.

The efficiency gains stack on top of that accuracy advantage: 200x smaller than the teacher model, under 50ms local inference vs. 400-700ms cloud API calls, and zero manual annotation; the LLM judge handled seed curation automatically.

The model achieves 78.3% exact match, which means roughly 1 in 5 queries may need a fallback step. For production deployments, consider adding a confidence threshold and routing low-confidence predictions to a larger model.

What comes next

The pattern here, extract traces, clean them, land them on Hugging Face, train a specialist, is the first step toward a continuous optimization loop for small models inside your agents.

Here’s what you can build on top of it:

1. Load traces from real observability providers

Use dlt to pull traces from:

  • LLM observability providers (e.g., Langfuse, Arize, Snowflake Cortex) via their REST APIs
  • OTel‑based sources like Dash0

All of these can be normalized and landed on Hugging Face using the same destination.

2. Build the pipeline from a conversation

dltHub’s REST API Toolkit lets you bootstrap a complete dlt pipeline by describing your source API to an LLM (e.g., Claude or Codex):

  • Authentication
  • Pagination
  • Incremental loading

You can go from an observability provider’s API docs to a running trace extraction pipeline in a single session, without boilerplate or manual doc spelunking.

3. Inspect traces before training

dltHub’s Dataset Browser gives you visibility into the extracted data inside the pipeline:

  • Column mismatches
  • Unexpected nesting
  • Schema anomalies

This is especially important when AI generates your extraction code: you can inspect what actually came back before trusting it in a training job.

4. Production observability for trace pipelines

Running a trace extraction pipeline once is a demo. Running it on a schedule against live traffic is production.

With dltHub’s native logging you get:

  • Real‑time progress per resource
  • Parallelization visibility
  • Detailed traces in CI/CD

This lets you diagnose failures in AI‑generated pipelines before they corrupt a training run.

5. Close the loop: continuous fine‑tuning from live traffic

The long‑term goal is a scheduled feedback loop:

  1. Incremental dlt loads pull fresh traces from production.
  2. Distil Labs runs automated training jobs using the new traces as context.
  3. New model versions are published to Hugging Face.
  4. Your agents deploy updated specialists that track evolving traffic patterns.

Over time, your small models keep improving as your agent sees new use cases and edge cases.

Try it yourself

Your production LLM agent has been describing your problem space in detail every time it handles a request. dlt makes that data accessible from wherever it lives and lands it on Hugging Face in a format that's ready for downstream consumption. Distil Labs turns it into a specialist model that outperforms the general-purpose system you're already running.

The full pipeline is open source. Clone the demo repository, swap in your own dlt source connector, and run the same stages against your own traces. The dlt part is fully reusable: if you can write a dlt pipeline to your data source, you can land those traces on Hugging Face and kick off training.

Join the dltHub Pro design partnership

dltHub Pro design partner