Blog//
From compute hours to data moved: a benchmark series
You pay for compute hours; what you actually want is data moved. This post measures the exchange rate across the four bottlenecks that dominate real pipelines: SQL copy, REST APIs, JSON files, and Parquet
Aman Gupta,
Data Engineer
What does an hour of dlthub actually buy you?
You pay money for compute time. What you actually want is data movement. So how do data movement workloads with dlt map to compute hours?
The rate isn't one number. In data movement we typically observe 4 different types of batch loads, each with their own challenges:
- Copying from SQL. Few bottlenecks, mostly network bound.
- REST APIs. The bottleneck is usually rate limits or slow responses, waiting on the API.
- JSON files. Schema inference and typing eat CPU. Same shape applies to XML, MongoDB, and similar sources.
- Parquet files. Bottleneck moves to memory and I/O.
So we want to run 4 benchmarks to investigate each of those scenarios.
| # | Scenario | What it measures | Bottleneck |
|---|---|---|---|
| 1 | SQL copy (Postgres → BigQuery, TPC-H) | dltHub engine ceiling on dense relational data | source serialization + network |
| 2 | REST API (coming soon) | throughput when rate limits dominate | source-side rate limits |
| 3 | JSON files (coming soon) | parse + schema inference cost | CPU |
| 4 | Arrow / Parquet (coming soon) | columnar in, columnar out | memory + I/O |
A note on hardware. Every benchmark in this series runs on a 2 vCPU / 4 GB worker. The numbers generalize to similar hardware wherever you run dlthub — your own infrastructure, your own laptop, another cloud's equivalent shape. Larger workers move the ceiling up only if CPU was the actual bottleneck, which is why we benchmarked four bottlenecks, not one.
Case 1: Starting with SQL copy
The engine ceiling. When nothing external is in the way — no rate limits, no parsing overhead, no I/O pressure — how fast does dlthub actually move rows from one database to another?
Setup
- Runner: 2 vCPU / 4 GB RAM
- Source: Postgres on GCP, 4 vCPU / 16 GB RAM / 200 GB SSD, US region
- Destination: BigQuery, US region
- Backend:
pyarrowon thesql_databasesource - Workers: 8 extract / 8 normalize / 8 load
- Chunk size: 150,000
- Load type: full refresh, single run per scale factor
- Dataset: TPC-H at scale factors 5, 10, 20, 50
Results
| SF | Rows | Postgres size | Runtime | GB/hour | M rows/hour |
|---|---|---|---|---|---|
| 5 | 43.3M | 8.21 GB | 7m 16s | 67.8 | 357 |
| 10 | 86.6M | 16.41 GB | 14m 58s | 65.8 | 347 |
| 20 | 173.2M | 32.82 GB | 30m 14s | 65.1 | 344 |
| 50 | 433.0M | 82.04 GB | 1h 12m 7s | 68.3 | 360 |
Linear scale. What works at 5 GB still works at 80 GB without surprises, and the per-hour rate doesn't drift as the workload grows. That's the property you want from a runtime when you're paying by the hour.
1 hour = 65 GB or about 350 million rows of Postgres data to BigQuery.
Worth noting: this result is achieved when co-locating the source, runtime and destination in the same region (in this case US).
Cases 2, 3, 4: coming soon
The SQL number is the ceiling on this hardware under the most favorable conditions. The other three benchmarks measure the same machine against different bottlenecks.
REST APIs. Most production dlthub pipelines aren't copying SQL — they're hitting REST sources. The ceiling there has little to do with our engine and almost everything to do with the source: rate limits, pagination overhead, response sizes, retry behavior under throttling. The next benchmark will isolate dlthub's overhead from the API's natural ceiling, so you can see what we add versus what the source costs you.
JSON files. This is where CPU starts mattering. Schema inference, type coercion, normalization of nested structures — all of it eats cores. We'll measure large semi-structured JSON files and show how throughput changes as record shape gets uglier.
Arrow / Parquet. When the wire format is already columnar, you skip most of the parsing cost — but now memory and I/O become the ceiling. We'll measure throughput on Parquet sources via the Arrow backend, with attention to how chunk size and worker count trade off against memory pressure on a 4 GB worker.
Each scenario gets its own post. Each post adds one more number to the exchange-rate calculation.
In practice: what a team's month looks like
The engine numbers are interesting and the hour cost is only $1, but you're probably wondering what this costs your team at the end of the month.
- Mid-size company data team, daily SQL replication from production Postgres into a warehouse. ~30 GB across multiple tables, daily full refresh. About 15 hours of compute, or $15/month.
- Same team, running hourly incremental every 15 min on top of the daily refresh. Roughly 90 hours, or $90/month.
- Larger team — analytics-heavy startup, several Postgres sources, ~100 GB daily full refresh into BigQuery. About 45 hours/month, or $45, before adding any REST or file workloads.
If your data movement is mostly databases, the monthly number lives in the tens-of-dollars range, not the thousands. The complete math waits until the other three benchmarks land.
Try it yourself
dltHub offers a no-credit-card trial: app.dlthub.com. First two weeks (30h included) are on us. Upgrade any time.