Towards a Benchmark for AI-Generated Data Pipelines
- Adrian Brudaru,
Co-Founder & CDO
In previous articles, I demonstrated how LLMs paired with Cursor and dlt can rapidly modernize legacy ETL scripts and even turn Airbyte YAML files into robust Python pipelines. But as with all things AI, understanding precisely where these models succeed or fail requires methodical testing.
Why a Benchmark?
When using AI for pipeline generation, it's helpful to think of it in three distinct parts:
- Feature extraction: Understanding the quirks and specific details of an API.
- Pipeline code generation: Actually writing and structuring pipeline code.
- Memory and intuition: The AI’s inherent understanding or predictions about common patterns from its training or prior experiences.
Clearly separating these tasks allows us to isolate problems effectively and makes troubleshooting manageable.
Testing Memory
To test the LLM’s inherent "memory" alone, without providing additional documentation, I tried building a pipeline for Pipedrive purely based on the model's training knowledge. I wanted it to fail so we can test what can make it better, and indeed it did fail.
- Endpoints: The LLM lacked awareness of all necessary endpoints.
- Authentication: Failed to recognize Pipedrive’s non-standard API-key authentication.
- Incremental Loading: Unable to correctly implement incremental loading via the "recents" endpoint.
Testing a Structured Extraction Prompt
To clearly evaluate the feature extraction step, I developed a structured extraction prompt designed explicitly for dlt’s rest_api_source
. It instructs an LLM precisely on which parameters to extract from API documentation, such as:
- API base URLs, authentication specifics, headers, and pagination.
- Endpoint-specific configurations, including paths, methods, query parameters, incremental loading, pagination patterns, and response transformations.
I created this prompt by giving our LLM-optimised REST API documentation and asking it for a prompt to extract the required info from code or documents containing it.
Here's a simplified version of the prompt:
"Given API documentation or a sample response, extract the necessary configuration parameters in a structured JSON format, including client configuration, pagination methods, incremental logic, and resource dependencies."
And the structured format we request:

Reality Check: Testing the Extraction Prompt
I tested this extraction prompt by having Perplexity and OpenAI directly pull relevant details from publicly available API documentation. I also tried by letting cursor index the docs and read from them. The results were mixed:
- Success with clearly structured docs: Top level docs worked well, and so did specific snippets.
- Failure with complex or nested docs: Formatting with "folded" elements like nested JSON examples, or ambiguous structure caused consistent confusion, resulting in incorrect or incomplete extraction. Specifically, LLMs got reliably confused on authentication (perhaps not found, perhaps confused with this page ) and incremental
Three Examples of docs issues
After the failures, i went to the docs to try to understand why LLMs reliably failed to grab the useful info. Here's what I noticed, and what I believe to be tripping up the information retrieval.
1. Authentication documentation is hard to reach. It looks like API Docs at developers.pipedrive.com do not contain this information - instead, a different set of docs on another DOMAIN (pipedrive.readme.io), contains the info.
2. Possibly for the same reason, recents endpoint was not fully understood - the required parameter (since_timestamp) was often not detected on the first pass.
3. On the same page as above, the response, which contains the "cursor field" for incremental, is not detected. This is possibly due to the way the code example is folded - LLMs may ignore it.

This clearly indicates that documentation quality and structure significantly impact an LLM’s ability to perform accurate feature extraction. Nested or unclear documentation structures, reliably disrupt extraction accuracy.
Patching the last mile
Since the relevant feature extraction did not work on the first pass, we were able to finish building the pipeline by taking relevant snippets from the docs (in the case of auth and response) and giving it to the LLM, or calling out missing pieces (like required parameters) to trigger it to look again with a more targeted search.
Takeaways of our experiment
- Relying purely on an LLM’s memory and intuition is optimistic and unrealistic.
- A failure to extract the information by multiple models could mean the documentation is reliably tripping up the LLMs.
- For information that's both unavailable (like api response data/pagination paths) and hard to guess, running the partially-built pipeline and inspecting responses will be key to getting all the info to finish it.
Evaluating Pipedrive as a Benchmark
Pipedrive clearly demonstrates the limitations of memory-based guessing and highlights some docs issues.
It's also OK for highlighting some docs feature extraction failures, but the failure is so reliable that it provides too little to experiment with.
What we really want is a benchmark case with more possible things that could go wrong, so we can inch towards finding solutions to those problems.
What's Next?
Ideally, the next benchmark should encompass broader, commonly challenging scenarios:
- Cursor-based pagination.
- Nested resource dependencies.
- Complex authentication (OAuth2, multi-step).
Strong candidates include APIs like Stripe, Jira Cloud, or GitHub, known for complexity and rich documentation.
Key Takeaways:
- Clearly separating extraction, code generation, and memory-based intuition simplifies troubleshooting.
- Structured extraction prompts provide the ability to evaluate feasibility of generating a pipeline from an API.
- API documentation formatting strongly influences extraction accuracy.
Call to Action:
Try applying the structured extraction prompt to your APIs. Let's collaboratively pinpoint exactly where LLMs succeed or fail, and establish a definitive benchmark for AI-generated data pipelines.