dltHub
Blog /

The REST API Integration costs: How AI + dlt is finally making it bearable

  • Aman Gupta,
    Jr. Data Engineer

There's a hidden cost in every REST API integration. No, it's not on your cloud bill. You won't see it in a KPI dashboard either. There's no budget review or Jira epic tracking this.

It's the REST API integration overhead. This cost quietly steals your focus, one schema mismatch at a time.

Every time you want to move data from SaaS Tool A to Warehouse B, you pay it. And it's never a one-click checkout. It's an unavoidable detour involving pagination, unpredictable schema changes, and rate limiting.

The root cause is semi-structured JSON data (with unknown schemas), a weakly typed interchange format that clashes with strongly typed analytical destinations. Maintenance adds another layer: homebrewed scripts and custom code that's often messy, undocumented, and challenging to decipher.

This hidden cost isn't just about initial setup; it's the ongoing cognitive load required to manage continuous uncertainty.

Removing the cognitive load and complexity with dlt+ AI

Here’s an example how you can build faster, with less complexity and no tech debtbt - use a LLM to accelerate your development, and let dlt self-maintain the pipelines.

Here’s the step by step:

1. Refer to the dlt docs for setting up your cursor.

2. Run dlt ai setup cursor to auto-generate applicable project rules for REST API sources. It’s convenient for OpenAPI specs or legacy code.

You can review the rules in .cursor/rules or via Cursor Settings → Rules.

Here’s a short video description of how we did it.

Video link here!

3. Next, we used the following prompt:

Hey!

I need a dlt REST API pipeline built using this OpenAPI spec:

👉 https://fast-api-jaffle-shop-jz2mh.ondigitalocean.app/openapi.json

The base URL is:

https://jaffle-shop.dlthub.com/api/v1

Can you include the following endpoints in the pipeline?

- customers, orders, products, supplies, stores

A few things to note:

- Check how pagination works for each endpoint. Use curl to hit the endpoints and see what kind of paginator they're using.
- If there's a way to load data incrementally (like using a created_at or updated_at filter), definitely take advantage of it.
- I only need data for the first 10 days of January 2017, so the pipeline should pull just that slice.

Let’s call the final Python file my_pipeline.py.

Ping me if anything’s unclear or if an endpoint behaves weirdly.


4. With a few iterations, Cursor produced code that worked as expected:

  • Correct pagination logic based on the API.
  • Data was loaded for the specified time frame.
  • Incremental loading was correctly implemented on the orders endpoint

This is the code produced:

import dlt
from dlt.sources.rest_api import rest_api_source

# Define the base URL for the Jaffle Shop API
BASE_URL = "https://jaffle-shop.dlthub.com/api/v1"

# Configure the dlt REST API source
# No authentication required for this API
jaffle_shop_source = rest_api_source(
    {
        "client": {
            "base_url": BASE_URL,
            # Default pagination: Header Link Paginator
            # This is used by all endpoints in this API
            "paginator": {
                "type": "header_link",
                "links_next_key": "next",  # Default value, matches the API's Link header
            },
            # Default page size parameter used by the paginator - MOVED TO resource_defaults
        },
        "resource_defaults": {
            "primary_key": "id",  # Assuming 'id' is the primary key for most resources
            "write_disposition": "merge", # Merge on 'id'
            "endpoint": {
                # No data selector needed as data is the top-level array
                 "data_selector": "$", # Select the whole list
                 # page_param is handled by the header_link paginator internally
                 # Default request parameters for all resources
                 "params": {
                    "page_size": 100 # Default from OpenAPI spec
                 }
            }
        },
        "resources": [
            {
                "name": "customers",
                "endpoint": {
                    "path": "customers",
                    # Inherits default pagination and params
                }
            },
            {
                "name": "orders",
                "endpoint": {
                    "path": "orders",
                    "params": {
                        # Apply the specific date range filter for this resource
                        "start_date": "2017-01-01",
                        "end_date": "2017-01-10",
                         # Keep default page_size - inherited from resource_defaults now
                    },
                    # Incremental loading configuration for orders based on 'ordered_at'
                    # Uses the 'start_date' parameter which we also use for the initial filter
                    "incremental": {
                        "cursor_path": "ordered_at",
                        # Set initial value to the start of the desired range
                        # Note: The API might ignore this if start_date is also present?
                        # Let's keep it aligned with the filter for clarity.
                        "initial_value": "2017-01-01T00:00:00Z", # ISO format often preferred
                        "start_param": "start_date", # API parameter for filtering
                        # end_param could be 'end_date' but we are setting a fixed window initially
                    },
                }
            },
            {
                "name": "products",
                "primary_key": "sku", # Product identifier is 'sku' - MOVED HERE
                "endpoint": {
                    "path": "products",
                    # primary_key was moved up
                    # Inherits default pagination and params
                }
            },
            {
                "name": "supplies",
                "endpoint": {
                    "path": "supplies",
                    # Inherits default pagination and params
                }
            },
            {
                "name": "stores",
                 "endpoint": {
                    "path": "stores",
                    # While stores have 'opened_at', the API spec doesn't show filtering params
                    # So, no incremental loading configuration here.
                    # Inherits default pagination and params
                 }
            },
        ]
    }
)

if __name__ == "__main__":
    # Configure the pipeline
    pipeline = dlt.pipeline(
        pipeline_name="jaffle_shop_rest_api",
        destination="duckdb",  # Example destination, can be changed
        dataset_name="jaffle_data",
        progress="log", # Add progress logging
        dev_mode=True # Recommended for development
    )

    # Run the pipeline with the configured source
    load_info = pipeline.run(jaffle_shop_source)

    # Print the outcome
    print(load_info) 

Conclusion

I've built dozens of REST API integrations over the years. The pagination debugging. The schema evolution headaches. The rate limiting dance.

Every single time, I told myself "this is just how it is with APIs."

But it's not.

With dlt and Cursor, I finally escaped the integration grind. What used to take days now takes minutes. What used to require deep API knowledge now works with a simple prompt.

This isn't just about faster development, it's about getting back to the work that actually matters. Building insights, not battling boilerplate.

Want to reclaim your time?

Stop grinding. Start building!