Blog /May 13, 2025

Breaking free from SQL: A Normie's guide to portable data pipelines

Adrian Brudaru,
Co-Founder & CDO

The real-world data struggle nobody talks about

I'm not here to pitch you some revolutionary paradigm that requires a PhD to understand. I'm talking about a practical problem we all face: constantly switching mental contexts between tools is exhausting and error-prone.

The tool treadmill is killing your productivity

Here's what normal data work actually looks like in 2025:

Monday: "Let me analyze this data in a pandas notebook"
Tuesday: "Now I need to move it to Snowflake and rewrite everything in SQL"
Wednesday: "Oops, that data is too big for Snowflake's pricing tier, let's try Spark"
Thursday: "The business needs it in Power BI, which means another rewrite"
Friday: "Everything's broken and I hate my job"

This isn't just frustrating, it's wasteful.

What if your code just... worked everywhere?

That's the problem we're solving with dlt datasets - a consistent interface for your data regardless of where it lives or which engine processes it.

Here's what it looks like in practice:

# This same code works across environments:
dataset = pipeline.dataset()
customers = dataset.customers

# Want SQL? No problem
high_value = dataset("SELECT * FROM customers WHERE lifetime_value > 1000")

# Prefer Python? That works too (uses ibis under the hood)
high_value = customers.filter(customers.lifetime_value > 1000)

# Need pandas for visualization? Got you covered
df = high_value.df()

# Working with a data scientist who loves Arrow? Easy
arrow_table = high_value.arrow()

The same code works whether your data is in DuckDB, BigQuery, Snowflake, or anywhere else dlt supports.

Real use cases, not hypotheticals

Development to production migration

You can develop your pipeline locally using DuckDB (free, fast), test it with a subset of data, and then deploy the exact same code to production on BigQuery or Snowflake by changing just the destination parameter.

# Local development
pipeline = dlt.pipeline(destination="duckdb", dataset_name="my_data")

# Production
pipeline = dlt.pipeline(destination="bigquery", dataset_name="my_data")

The rest of your code stays exactly the same.

Cost optimization without code changes

Many teams run preprocessing locally or in cheaper compute environments, then load only the results to more expensive warehouses. With dlt, the same code runs in both places. You can even shift compute to a "production local" worker and arrow-fast-sync your data to the final destination.

Here's an example of how this can be done:

This approach gives you the best of both worlds:

Cost-effective processing in environments you control

# Step 1: Process data locally or in a cheaper environment
# This could be running on a Lambda, GitHub Action, or your own server
import dlt
from dlt.sources import rest_api
from raw_data_source import raw_data

# Set up a local DuckDB pipeline for processing
local_pipeline = dlt.pipeline(
    pipeline_name="preprocess_data",
    destination="duckdb",
    dataset_name="processed_data"
)

# Run your source to extract and transform data
local_pipeline.run(raw_data())

# Step 2: Access processed data through the dataset interface
dataset = local_pipeline.dataset()

# Step 3: Run additional transformations if needed
transformed = dataset("SELECT * FROM raw_data WHERE value > 100")

# Step 4: Arrow-fast-sync to your expensive production warehouse
# No need to rewrite transformations or queries!
bigquery_pipeline = dlt.pipeline(
    pipeline_name="production_load",
    destination="bigquery",
    dataset_name="final_data"
)

# Load the transformed data directly using Arrow for efficiency
bigquery_pipeline.run(
    transformed.iter_arrow(chunk_size=10000),
    table_name="optimized_data"
)

Final data available in your production warehouse
No code changes between environments
Efficient transfer using Arrow's columnar format

Testing without breaking the bank

Testing data pipelines in production environments is expensive and risky. With dlt, you can test locally with real code, then deploy to production when you're confident it works.

How it works (the actual mechanics)

We didn't reinvent the wheel here - dlt stands on the shoulders of some amazing open-source projects. Let me walk you through what's actually happening behind the scenes.

When you write code using our dataset interface, we're leveraging tools like ibis for the Python expression interface and sqlglot for SQL dialect translation. We're part of a composable ecosystem that's bringing Python-native data tools together.

The beauty of this approach is that we get to focus on the user experience while these specialized libraries handle the heavy lifting. Ibis gives us that consistent DataFrame-like API across 15+ backends, while sqlglot helps translate SQL between different dialects.

What we've done is connect these pieces into a coherent experience. We handle the messy parts - figuring out how to connect to different systems, managing schema evolution, dealing with data type conversions, and making it all feel like one consistent interface.

This composability is a big deal. Instead of building yet another monolithic system, we're part of a movement toward smaller, interoperable tools that each do one thing well. The future of data tools isn't a single framework to rule them all - it's an ecosystem of specialized tools that play nicely together.

Here's another example of this philosophy in action: when you write with dlt to files, we automatically spin up an ad hoc in-memory DuckDB instance to handle dataset querying operations, which DuckDB already does this brilliantly. This approach gives you super-efficient sql-native capabilities without any extra setup or configuration on your part.

Practical limitations (being honest here)

Like any abstraction, there are trade-offs:

Engine-specific features: Some specialized features of specific engines might not be available through the common interface.
Learning curve: You'll still need to understand how data works across environments to use this effectively.
Best for tabular data: The current implementation works best with structured, tabular data.

Try it on something small first

I'm not asking you to rewrite your entire data stack. Start with something small:

Pick a simple data task you do regularly
Implement it with dlt (should take a few minutes)
Try running it in two different environments
See if it actually saves you time and headaches

The bottom line

Data engineering shouldn't require rewriting the same logic multiple times for different environments. dlt's dataset interface gives you one consistent way to work with your data, regardless of where it lives.

If you're tired of the Monday-to-Friday cycle of rewriting and debugging the same logic across different environments, give it a try on your next project.

Your future self will thank you.

Read more about it in the dataset access docs.

dlt for Databricks: staging, notebooks, zero config

Multi Asset REST API Pipelines with dlt and Dagster

The real-world data struggle nobody talks aboutLink icon

The tool treadmill is killing your productivityLink icon

What if your code just... worked everywhere?Link icon

Real use cases, not hypotheticalsLink icon

Development to production migrationLink icon

Cost optimization without code changesLink icon

Testing without breaking the bankLink icon

How it works (the actual mechanics)Link icon

Practical limitations (being honest here)Link icon

Try it on something small firstLink icon

The bottom lineLink icon