dltHub
Blog /

dltHub AI Workbench: Ontology driven data modelling toolkit preview

  • Hiba Jamal,
    Junior Data & AI Manager

Every data team has a version of this horror story. Yours might star different systems, but the plot is always the same: three sources, zero shared keys, and a stakeholder who needs answers yesterday.

We built the AI Workbench transformation toolkit to kill that story at its root. Not with more generated code, but with a better, simpler way to think about your data before you even touch any tables.

The Three-Headed Customer

Here's the setup. You have a Slack export, an Event database, and a HubSpot instance. Three systems, three worldviews, zero overlap in naming. Then the VP of Growth walks over and asks:

"Which Slack members who joined in Q1 became 'Qualified Leads' after attending a couple of our events?"

You open the schemas and the nightmare begins.

Slack has "Members." The Event App has "Guests." HubSpot has "Contacts." Same people, different masks. The fields don't line up. The IDs don't match. The timestamps are in different formats because of course they are.

And then, when you’re merging it all - how do you do it? which record takes precedence? by what ID? what about the other fields, what’s your coalesce logic?

So you do what everyone does , you hammer out a SQL script with fuzzy joins and nested CTEs. You match on email where you can, fall back to name-matching where you can't, and pray that nobody at your company has a common name. It works today. The VP is happy today.

But the moment they ask for one more interaction-based metric, say, "how many of those leads also joined our onboarding slack channel?", you're back at the drawing board, duct-taping another fragile query onto the pile. Each new question is a new expedition into the same uncharted swamp.

This is the horror story of working without a canonical data model. It’s not a single catastrophic failure, but a slow death by a thousand bespoke ad-hoc queries that will never match numbers.

Why "Just Use an LLM" Fails

At this point someone on the team says: "Can't we just throw this at an LLM?"

Fair question. LLMs are good at understanding schemas. They can read column names, infer relationships, even generate decent SQL. But how are they supposed to know the business logic of how you use the 3 apps to manage customers? How are they supposed to know what a customer means to you? And if that logic is highly complex and specific, how are they supposed to come up with it?

The main problems you will stumble into

  • Identity - What are these contacts, members, guests? Are they all a customer? lead? or maybe they’re a marketplace provider, b2b contact, or something else.
  • Intent - What kind of problems are you looking to solve with this data? without a “problem space”, we cannot formulate the actors and the actions we need to represent.

The what and the how: Taxonomy and Ontology

The “bridge of intent” connects the raw data to intended business usage of the data. Before writing a single line of transformation code, you need two layers of structure. Think of them as giving the AI a map and a mission briefing.

Der Prozess

The Taxonomy — what things ARE. This is where you define the fundamental entities in your world. You explicitly declare that a "Guest," a "Contact," and a "Member" are all just a Person. An "Event RSVP," a "HubSpot Activity," and a "Slack Message", “Slack channel join” or “Slack emoji reaction” are all types of Interaction. This collapses hundreds of confusing, system-specific columns into a handful of canonical concepts, reducing follow up query complexity from “data engineering rocket science” to “even the VP of Growth can do it in a pivot table”.

The Business Ontology — how things BEHAVE. This is the brain. It's where your business rules and goals actually live. The ontology encodes the logic: a Guest is a Person who interacted with an Event. A Qualified Lead is a Person who hit a specific engagement threshold in HubSpot. It maps the journey from Slack Member to HubSpot Lead based on your specific use cases. This ensures the LLM understands how your source systems join together into a single table, or how various columns, events or subqueries end up creating an activity table.

Together, the taxonomy and ontology form a structured intent layer. The taxonomy tells the system what exists. The ontology tells it what matters and how things connect. When an LLM operates downstream of this layer, It's executing against a well-defined model of your business instead of making stuff up.

From Ontology to a Canonical Data Model

With the ontology locked in, the toolkit does the thing that normally takes weeks of meetings, whiteboard sessions, and heated Slack threads about whether to call it user_id or person_id: it generates a Canonical Data Model.

The CDM is a technology-neutral common language where all your sources finally speak the same tongue. Standardized naming — "Person," not "User" or "Guest" or "Contact." It’s just a system-neutral representation of your business entities, without the “star schema” denormalization.

Why not star schema? Star schemas, though pre-joining and denormalisation, reduce the amount semantic meaning in the data, making them a poor “definition layer”, and rather, more suitable as a “performance and access layer” downstream of the canonical data model.

This is what the dlt AI Workbench transformation toolkit produces. You feed it your sources and your use cases. The toolkit annotates your source tables, builds the ontology, and generates data model that captures the meaning of your data.

Why This Matters: Definition First, Code as Consequence

Here's the thing most data teams get backwards: they start with code and try to reverse-engineer meaning from it. Write the SQL, build the dashboard, then months later attempt to document what any of it actually represents. The ontology-first approach flips that entirely. You start with definition — precise, structured, reusable definition — and the code becomes a consequence.

This isn't just philosophically cleaner. It's dramatically faster and higher quality.

This means something that was previously impractical becomes feasible: starting with a high-quality data model from day one. Not the "let’s provide something and figure it out later” model that never gets cleaned up. Not the "good enough for now" star schema that becomes load-bearing infrastructure before anyone notices. An actual, well-thought-out canonical model, generated fast enough makes it so you don't have to choose between quality and shipping speed.

But the real payoff goes beyond the initial modeling sprint. The ontology is a thinking substrate.

Slime molds take graph like shapes if nutrition is provided as nodes, similar to how an LLM uses a graph to think.

The tribal knowledge about which field maps to what, that used to lives in one engineer's head, is now documented for everyone. For agents and humans alike, retrieving the data and understanding how it relates, suddenly becomes possible.

Once you have a precise, machine-readable definition of your business domain — entities, relationships, rules, goals — that ontology becomes a thinking substrate for everything downstream. An agent that needs to answer the VP's next question doesn't start from raw schemas. It starts from a structured representation of what your business actually cares about. It knows that Members, Guests, and Contacts are People. It knows what a Qualified Lead means in your context. It can reason about new questions against a foundation that already encodes the hard-won knowledge of your domain.

Is this like a semantic layer? No, the semantic layer only describes how to retrieve “revenue” from your data - not what the data means. It’s like the difference between someone who can make dashboards without understanding what the data means, and an actual analyst.

With the ontology and the CDM, you now solved the AI data literacy problem - you mapped how your business world relates to the data, and captured the definition for re-use.

Try It

The transformation toolkit is part of the dltHub AI Workbench — an open collection of toolkits that plug into your AI coding assistant (Claude Code, Cursor, or Codex) and give it the skills to build, explore, and transform dlt pipelines.

To get started, install dlt with hub support and initialize the workbench:

uv pip install "dlt[hub]" 
uv run dlt ai init uv run dlt ai toolkit transformations install

Or if you're already in a Claude Code session:

/plugin marketplace add dlt-hub/dlthub-ai-workbench 
/plugin install transformations@dlthub-ai-workbench --scope project

Then ask your assistant to annotate-sources — that's the entry point. It'll walk you through your existing pipelines, map your schemas to canonical concepts, and kick off the ontology → CDM workflow from there.

The full workbench includes toolkits for REST API ingestion, data exploration, and production deployment too — so you can go from raw API to deployed, well-modeled pipeline without leaving your editor.

The ontology-driven data modelling toolkit is part of the dltHub AI Workbench, available in dltHub Pro, due for release in Q2 and currently in design partnership stage. Currently, it’s being leveraged by commercial data engineering agencies who benefit from standardisation and acceleration. if you’re interested, apply for the design partnership!