Blog//
The LLM got the right answer for the wrong reason
Schema alone scored 3/10. An ontology scored 10/10. A benchmark across two datasets showing exactly where the gap is, including cases where the model gets the right answer for the wrong reason.
Roshni Melwani,
Working Student
I ran a benchmark to test whether an ontology improves how well an LLM answers questions about business data.
On a proprietary SaaS dataset, an LLM with just the data and a schema scored 3/10.
The same model scored 10/10 when it had an ontology too.
An ontology is the layer of business rules that tells the model what your data means. Schema alone wasn't enough.
If the concept is new, this post covers what an ontology is in a data engineering context.
The first experiment
The first dataset was FDA adverse event data: public, well-structured, a good starting point.
I asked the LLM the same 10 questions twice: once with the ontology in context, and once without.
Both scored 10/10. That made me suspicious because in other projects where I’d added an ontology, it had changed the answers noticeably.
A public dataset scoring perfectly without one didn’t seem right. Either the questions weren’t hard enough, or something in the setup was off.
The 10/10 was a tell
So I ran the same questions with no data attached at all. The model still answered correctly, it had been trained on FDA data and didn't need mine. That's the more dangerous failure: if it's answering from training rather than your data, the output won't tell you, and you'd never notice when they diverge.
Then I looked more carefully at CDM.dbml. The column notes had business rules written directly into them, the FDA definition of "serious" was in the note for serious_event_count. The model wasn't getting data and a bare schema. It was getting the ontology, hidden inside CDM.dbml.
Stripping the schema bare
So I redesigned the test. I stripped CDM.dbml down to just column names and types, CDM_bare.dbml, no annotations. And I needed data the model hadn't been trained on: FDA is public, but internal business logic isn't. A synthetic SaaS churn dataset fit, every rule in it exists only inside that company's system.
Three conditions, run across both datasets:
- A - raw tables only.
- B - data plus
CDM_bare.dbml. - C - data,
CDM_bare.dbml, andontology.md.
Claude Sonnet (claude-sonnet-4-6) via the Anthropic API. 10 questions per dataset per condition, 60 API calls total.
The results
The full model responses are in responses.md in the repo.
SaaS churn, A = 3/10 · B = 3/10 · C = 10/10
A and B scored identically. Column names and types tell the model what fields exist and not what they mean.
| Question | A | B | C |
|---|---|---|---|
| sq02, How many seats at-risk? | ✅ | ✅ | ✅ |
| sq04, 55% at-risk seats = churned? | ⚠️ | ⚠️ | ✅ |
| sq08, Resubscribe after 45 days = recovered or new? | ❌ | ❌ | ✅ |
sq02 is a count.
It’s in the data, so all three read it.
sq04 is the quiet failure.
A and B said no, which was correct, but they reasoned from `status=active` with no knowledge the 60% threshold exists. Ask the same question about 65% and they’d still say no, for the same reason. C said no because 55% is below 60%. That reason generalises; theirs doesn’t.
sq08 is the contradiction case.
A and B assumed any resubscription is a recovery, because that’s the reasonable prior. The actual rule: after 30 days it resets to a new customer. The rule contradicts the prior, so only condition C got it.
The full 10-question table and all model responses are in the repo.
OpenFDA drugs, A = 8/10 · B = 9/10 · C = 10/10
The scores look close. And that is the trap.
| Question | A | B | C |
|---|---|---|---|
| fq01, Ozempic serious adverse events | ✅ | ✅ | ✅ |
| fq04, Indications for Tylenol | ❌ | ✅ | ✅ |
| fq08, Which field determines where warnings come from? | ❌ | ❌ | ✅ |
fq01 A answered from memory. It would have returned 5,186 whether the data was attached or not. It wasn’t reading the pipeline, it was reciting training knowledge.
fq04 and fq08 are the only questions that needed pipeline-specific knowledge. On fq04, the label in the data was for an IV formulation, not the OTC product. A missed it entirely.
On fq08, the pipeline runs `COALESCE(warnings_and_cautions, warnings)` and that logic lived only in `ontology.md`. A and B both said no such field exists. Even though it does.
The gap hides in well-known domains
On SaaS, condition A is wrong on 7 of 10 and you can see it.
On FDA, it sounds grounded on almost everything, but for most of those questions it would give the same answer whether your data was there or not. If your data ever diverged from what the model learned in training, the output would never tell you.
Three SaaS questions are marked ⚠️ instead of ❌. A and B gave the right surface answer but for the wrong reason, reasoning from `status=active` with no knowledge the actual thresholds exist.
A model that gets the right answer accidentally will fail silently on edge cases, and there’s no way to spot it from the output alone.
Schema tells the model what exists, not what it means
The gap is biggest where your rules contradict what the model already knows. The SaaS rules were designed to do exactly that, every one contradicts a reasonable prior. In a public domain like FDA, the model's priors are mostly right, so the gap is smaller.
A correct answer doesn’t mean the model is grounded in your data. In well-known domains especially, it’s worth checking whether the model is reasoning from your data or from training knowledge. They look identical in the output.
The right answer for the wrong reason is a reliability risk, not a pass. It will hold up on the question as asked and fail on the next variation.
The ontology came out of the modeling, not on top of it
The `ontology.md` wasn’t a separate deliverable. It came out of the modeling workflow in the AI Workbench.
When it builds a CDM, it asks clarifying questions about your domain and captures the answers as structured business rules.
Try it:
- Full benchmark, 60 model responses, the SaaS ontology, the bare schema: repo
- Build the ontology for your own data: AI Workbench