dltHub
Blog /

Debugging Our Docs RAG, Part 2: Testing New Generation Models

  • Aashish Nair,
    Working Student Data & AI

In Part 1, we established a baseline. Using a small evaluation set built from real user questions, we measured how our production RAG system actually performed. The result was clear and uncomfortable: only 3 out of 14 questions were answered correctly.

At that point, we knew the system was broken, but not why. RAG failures generally fall into two broad categories:

  1. Retrieval failures, where the right information is never surfaced.
  2. Generation failures, where the right information is present, but the model fails to use it correctly.

Before touching ingestion, chunking, or embeddings, we wanted to isolate the simplest variable in the system: the generative model.

Isolating the Generation Step

To keep the experiment controlled, we froze everything else:

  • Same retrieval pipeline
  • Same chunking strategy
  • Same prompts
  • Same evaluation dataset

The only thing we changed was the LLM used for generation.

This allowed us to answer a narrow question:

If retrieval stays exactly the same, how much can better models compensate for a noisy context?

The Models We Tested

The production system was still running on a legacy model selected in 2023. We compared it against several newer models:

  • GPT-5
  • GPT-5.2
  • Claude 4.5
  • Gemini 2.5
  • Gemini 3

All models were evaluated on the same 14-question dataset, using identical retrieved chunks, and scored manually using the same criteria defined in Part 1.

Results: Clear Gains, Same Ceiling

The improvement over the legacy model was immediate.

  • Legacy production model: 3 out of 14 correct
  • Top-performing models (Gemini 3, GPT-5.2): 10 out of 14 correct

This represents more than a 3x improvement, achieved without touching retrieval at all.

One important qualitative difference stood out immediately. The newer models performed much better on the baseline sanity checks, which the original production model often failed outright. Simple definitional questions were answered more reliably and with fewer hallucinations.

Persistent Failure Modes

Despite the improvement, no model achieved perfect performance. Several failure patterns persisted across all tested models.

  • Needle-in-a-Haystack Retrieval Failures: Questions whose answers lived in a single sentence buried deep in a long documentation page continued to fail frequently. Even strong models struggled to locate the relevant detail inside large, noisy chunks.
  • “Multiple Choice” Hallucinations: A common pattern was the model presenting several possible solutions, where: These answers may look helpful at first glance, but from an evaluation standpoint they are still failures. A support tool that forces the user to guess which option is correct is not doing its job.
    • one option was correct
    • the others were partially wrong or entirely fabricated
  • Omission of Critical Details: Another recurring issue was partial correctness. Some answers were directionally correct but omitted configuration flags, constraints, or caveats that would matter in practice.

For the purposes of this evaluation, partially correct answers were counted as wrong. In reality, many of these responses were still higher quality than what the legacy system produced, but they did not meet the bar for correctness we want from a documentation assistant.

What This Round Actually Measured

It is worth being explicit about what we evaluated, and what we did not.

Typical RAG evaluations break the problem into two separate checks:

  • Retrieval quality, meaning whether the retrieved chunks are relevant to the question.
  • Generation quality, meaning whether the final answer is accurate and useful given that context.

In this round, we intentionally focused only on generation quality. Retrieval was held constant across all experiments. When answers failed, we did not attempt to determine whether the root cause was retrieval or generation.

That deeper analysis will come later, when we decide to scrutinize the retrieval component more closely.

What This Tells Us

This round confirmed two important things.

First, upgrading the generative model is low-hanging fruit. Modern models are dramatically better at handling noisy context and answering straightforward questions correctly.

Second, there is a clear performance ceiling. Even the best models plateaued at 10 out of 14 correct, and the remaining failures were consistent across vendors. This strongly suggests that further gains will not come from generation alone.

What’s Next

With generation largely understood, the next round of iteration moves one step upstream.

The next low-hanging fruit is embedding models. Better embeddings may help differentiate highly similar documentation pages and surface more specific chunks, especially for the needle-in-a-haystack cases.

In future rounds, we will also expand the evaluation to explicitly score retrieval quality alongside generation. For now, this experiment gave us exactly what we needed: a clear signal that model upgrades help, but they are not the whole solution.

Iteration beats intuition. On to the next knob! Stay tuned for the next part of the series!