dltHub
Blog /

Debugging Our Docs RAG, Part 1: Evaluating a Production RAG System

  • Aashish Nair,
    Working Student Data & AI

A few months ago, around the same time the broader community started questioning whether GPT-4 quality was degrading, we noticed a noticeable decline in the performance of our own documentation helper.

dhelp is an internal chatbot built by the dlt team to answer questions about running dlt pipelines. Like many GenAI tools created in 2023, it was built against the assumptions and model capabilities of that time. As usage grew and expectations increased, those assumptions began to show their limits.

Certain failure patterns became easier to identify: hallucinated APIs, invented parameters, and answers that were technically plausible but completely irrelevant to the user’s actual problem.

Since the degradation happened without any explicit change on our side, the most tempting response was obvious: just upgrade the model. But swapping components in a probabilistic system without understanding the failure modes is rarely good engineering. Before touching anything, we wanted to answer a more basic question:

What exactly is broken, and why?

This post covers how we built a small but representative evaluation dataset, how the production system performed against it, and what that baseline revealed.

The Evaluation Problem

RAG quality is notoriously hard to measure. There is no unit test for “helpfulness,” and synthetic datasets often optimize for the wrong things.

Instead of generating artificial questions, we built our evaluation set from real user questions. Drawing from historical support conversations, the dlt team selected 14 representative queries that reflect common, high-impact user pain points.

Each question was chosen to probe a specific failure mode:

  1. Baseline sanity checks Simple definitional questions that any functioning docs assistant should answer correctly.
  2. Needle-in-a-haystack retrieval Questions where the answer exists in the docs, but only as a single sentence buried deep inside a large page.
  3. Out-of-domain queries Questions whose answers do not exist in the public documentation, where the correct behavior is to say “I don’t know,” and not to hallucinate.

For this first iteration, we intentionally avoided contrived edge cases. The goal was not to maximize coverage, but to create a small, trustworthy baseline we could iterate on quickly.

Evaluation was done manually by the dlt team, using our own support experience as ground truth. While we plan to automate this process later, for example, by using “LLM as a judge” (more on that in a future article), human verification was essential to establish an initial reference point that we could trust.

Baseline Results: Production Performance

Once the evaluation dataset was ready, the next step was straightforward: see how the production version of dhelp performed against these questions.

Under this setup, the production model resolved 3/14 cases to our satisfaction.

The results aligned with patterns we had been observing in production, while making the system’s limitations more explicit. Even relatively simple questions exposed weaknesses, and more targeted queries frequently triggered failure modes that we had been seeing anecdotally in production.

Failure Analysis: What Went Wrong?

Looking at the failed answers, several clear patterns emerged.

1. Hallucinations Were the Dominant Failure Mode

Many responses contained:

  • Made-up functions or classes
  • Nonexistent configuration parameters
  • Confident but false statements

In some cases, the answers sounded convincing enough to lead users into fruitless debugging loops.

2. Retrieval vs. Generation Was Unclear

In other cases, the model produced vague or partially relevant answers. This raised an important ambiguity:

  • Did the retriever fail to surface the correct documentation chunks?
  • Or did the generator fail to locate the relevant information inside the retrieved context?

With the current setup, these two failure modes were impossible to disentangle.

3. “Technically Correct, But Unhelpful”

A smaller but still important category involved answers that were technically accurate but failed to address the user’s specific situation. They missed required constraints, environment details, or caveats.

From a user’s perspective, these answers are indistinguishable from being wrong.

What the Baseline Tells Us

This evaluation made one thing very clear: the system is not failing for a single reason.

The observed performance suggests that meaningful improvements are possible through RAG configuration alone, without changing the product surface or adding new data. In particular, there are several obvious levers to explore:

  • Generative model choice
  • Embedding model choice
  • Chunking strategy such as chunk size, number of chunks, and similarity thresholds
  • System prompt design

Crucially, the evaluation also gave us a way to measure progress. Any change we make going forward can be tested against the same 14 questions, allowing us to move from intuition to evidence.

What’s Next

With a baseline of 3/14 established, the next step is to focus on low-hanging fruit, changes that are easy to implement and likely to produce large gains.

In the next part of this series, we will start by isolating the generation layer. We will test several newer models from OpenAI, Google, and Anthropic against the same evaluation set and compare their performance directly to the current production setup.

Only once we understand the limits of better generation will we move on to deeper retrieval and ingestion changes.

For the first time, the problem is no longer “it feels broken.”

It is “we know how broken it is, and we know how to measure fixes.”