dltHub
Blog /

Supercharge your data loading: Go beyond `pandas.to_sql()`

  • Adrian Brudaru,
    Co-Founder & CDO
This microblog is about an upcoming workshop at https://odsc.ai/

Intro

If you work with data in Python, you know the workflow: pull data from an API, load it into a pandas DataFrame, and explore. It’s a powerful and fast way to get answers. When you’re done, a simple df.to_sql() seems like the logical next step to save your results. For one-off tasks, it works perfectly.

But what happens when that script needs to run reliably every hour? What happens when the data volume grows 100x, or the source API suddenly adds a new field? The simplicity of df.to_sql() quickly reveals its limits, and you find yourself facing the challenges of building a true production pipeline.

The production-ready checklist

Evolving a simple script into a robust, automated pipeline requires solving a new set of complex problems.

1. The memory trap: From DataFrame to Data stream

The greatest strength of a pandas DataFrame, holding all data in memory for fast analysis, is its greatest weakness in production. Loading a massive JSON response or a huge file into a DataFrame is a recipe for an out-of-memory error. Production pipelines must handle data as an efficient stream, processing it in chunks without ever holding the entire dataset in memory. This is a fundamental architectural shift from a simple script.

2. The burden of state: Building incremental loads

Reprocessing an entire dataset every time is slow, costly, and can hit API rate limits. A production pipeline should be stateful, intelligently processing only new or updated records. Building this from scratch is surprisingly complex. You need to track the last loaded timestamp or ID, store this "state" somewhere persistent, and reliably retrieve it at the start of every run.

3. The chaos of reality: Resilience and schema changes

Production is chaotic. APIs fail, so you need automatic retries. Networks drop, so you need atomic loads to prevent partially written, corrupted data. But the biggest source of chaos is the data itself. A new field added to the source JSON can break a script that expects a fixed structure. A truly resilient pipeline must be able to detect and adapt to schema changes automatically, altering the destination table without breaking.

A higher-level approach to data loading

After seeing these challenges, it’s clear that building robust pipelines requires a higher-level, declarative approach. Instead of manually coding the logic for streaming, state management, and schema evolution every time, you can use a tool that treats these as solved problems.

The open-source Python library dlt is one such tool. It provides a simple interface to declaratively build pipelines that handle these complexities for you. You can focus on your data's source and destination, while dlt manages the production-grade infrastructure underneath.

Build it yourself at ODSC West 2025

In this hands-on session, we will tackle these challenges head-on. You will learn how to:

  • Start with a simple script that breaks on large files and refactor it to stream data efficiently.
  • Implement a stateful, incremental pipeline to process only new data from a live API.
  • Build a pipeline that automatically adapts when the source API adds new fields, preventing breakages.
  • Explore higher-level REST API sources that abstract away complexities like pagination and authentication for you.

Stop hitting the limits of simple scripts. Level up your data loading skills and learn to build pipelines you can trust to run in production. See you at ODSC West 2025!

This talk is a tutorial follow-up to the problem presented in our 2024 EuroPython talk which you can watch below: