Blog//
From compute hours to data moved: a benchmark series
You pay for compute hours; what you actually want is data moved. This post measures the exchange rate across the four bottlenecks that dominate real pipelines: SQL copy, REST APIs, JSON files, and Parquet
Aman Gupta,
Data Engineer
What an hour buys depends on what's in the way
You pay dltHub for compute time. The thing you care about is how much data that time moves. The two aren't the same, and the ratio isn't one number — it's set by whatever is slowest in the path. So we ran four benchmarks on one 2 vCPU / 4 GB worker, one per bottleneck.
| Case | Bottleneck | One hour moves |
|---|---|---|
| Parquet files | memory and I/O | ~170 GB · ~1.1B rows |
| SQL (Postgres) | network and serialization | ~65 GB · ~350M rows |
| JSON files | CPU, type inference | ~4.6 GB · ~47M rows |
| REST (HubSpot, GitHub) | the source's rate limit | whatever the API allows — ~225–275k rows/h in these runs |
Same machine, ~30x spread. The engine isn't the variable. The format and the source are.
A small team usually runs all four: a database on a schedule, a few APIs, some files. Budget about an hour a day for each and the numbers above are what land.
Tools that bill per row charge you for every nested record that fans out on the way in — and as you'll see, the fan-out is real. dltHub bills the time. The rest of this post is the methodology behind the table.
SQL: the engine ceiling
With nothing external in the way — no rate limit, no parsing, no I/O pressure — this is as fast as the engine goes. It's the ceiling the other three cases fall short of, and the reason they fall short tells you where your money goes.
Setup. 2 vCPU / 4 GB runner. Postgres on GCP (4 vCPU / 16 GB / 200 GB SSD, US) → BigQuery (US). pyarrow on the sql_database source. 8/8/8 workers, chunk size 150,000, full refresh. TPC-H at SF 5, 10, 20, 50.
| SF | Rows | Postgres size | Runtime | GB/hour | M rows/hour |
|---|---|---|---|---|---|
| 5 | 43.3M | 8.21 GB | 7m 16s | 67.8 | 357 |
| 10 | 86.6M | 16.41 GB | 14m 58s | 65.8 | 347 |
| 20 | 173.2M | 32.82 GB | 30m 14s | 65.1 | 344 |
| 50 | 433.0M | 82.04 GB | 1h 12m 7s | 68.3 | 360 |
The rate holds from 8 GB to 82 GB. It doesn't drift as the job grows, which is the property you want when you're billed by the hour: you can estimate a 10x larger job by multiplying, not by re-benchmarking.
1 hour = 65 GB, ~350M rows, Postgres to BigQuery — with source, runtime, and destination in the same region.
REST: you don't set the rate, the API does
SQL is network-bound and nothing else. A REST source moves the bottleneck off your machine entirely. Records per request, requests per window, how long a 429 makes you wait — those are the API's numbers, not yours. No worker size changes them.
So the question isn't how fast dltHub is here. It's whether dltHub adds anything on top of the wait. We ran two sources that bracket the range.
Setup. 2 vCPU / 4 GB (US) → BigQuery (US). HubSpot CRM REST API v3 (free trial), cursor pagination, ~100 records/request, 5 retries to 16s. GitHub GraphQL API, microsoft/vscode issues, 100 items/page, sleep-until-reset under 50 credits.
HubSpot — 4,960 records, total 2m 11s (1m 7s extract, 369ms normalize, 1m 3s load).
| Object | Rows |
|---|---|
| Contacts | 950 |
| Companies | 100 |
| Deals | 400 |
| Notes | 2,245 |
| Tasks | 1,265 |
| Total | 4,960 |
Normalize is 369 milliseconds. The run is request round-trips end to end; the worker is idle. That's the normal case for a CRM. Note the fan-out, because it's where per-row billing bites:
- Pulled from the API: 4,960 records.
- Landed in BigQuery: 9,177 — the 4,960 plus 4,217 association rows (companies__contacts, companies__deals, contacts__deals) that normalize into their own tables. 1.85x.
- On dltHub: 2m 11s, ~$0.04 a run, ~$1/month daily. A per-row tool meters the 9,177, not the 2 minutes.
GitHub — 124,875 rows, total 33m 8s (32m 20s extract, 14s normalize, 32s load).
| Table | Rows |
|---|---|
| issues | 30,000 |
| issues__comments | 63,228 |
| issues__comments__reactions | 15,863 |
| issues__reactions | 15,784 |
Extract is 98% of the clock. Normalize and load are 46 seconds combined, for 125k nested rows. The other 32 minutes are the API handing back 100 items at a time and making us sleep until the window resets. That time is not ours to optimize, and a bigger worker doesn't touch it — on compute-hour billing you pay for the wait the source imposes, and that's the whole bill.
Both sources landed roughly 225–275k rows an hour — HubSpot ~252k, GitHub ~226k — but read that as the rate each API allowed on the day, not a number dlt sets. Point a faster API at the same worker and it goes faster; point a stricter one and it crawls.
Files: the format decides, not the engine
SQL and Parquet hand you a fixed shape — the types come with the data. JSON and REST don't: both arrive untyped, and the engine has to work out what every field is. The difference between the slow case and the fast one is exactly that — typed or untyped at the source. We ran both ends on the same worker.
JSON is untyped, so the CPU pays. Every field gets inspected, every type inferred, every nested array split into its own table. On messy data the inference is the job. We built the input to be hostile: 6.9M deals where a field is a number in one record and a string in the next, three date formats, values that are empty or null or missing. Normalize ends up being most of the run.
5 GB JSONL, 6.9M deals, GCS → BigQuery. Total 48m 16s (extract 15m 45s, normalize 30m 19s, load 2m 11s).
| Table | Rows | Size in BigQuery |
|---|---|---|
| deals | 6,902,021 | 1.34 GB |
| deals__contacts | 17,260,564 | 1.73 GB |
| deals__tags | 13,800,852 | 656 MB |
| Total | 37,963,437 | 3.73 GB |
Normalize is 63% of the run, CPU-bound and pinned on a 2-core box. This is the one case in the series where throwing hardware at it actually moves the number — a bigger machine, or splitting normalize across more workers. The work parallelizes; the 2 cores were the limit, not the engine. Rate: ~4.6 GB/hour landed, ~8.6M source records/hour. The 6.9M deals become ~38M rows once contacts and tags split out — size the job by source records, not by what lands, or you'll be off by 5x.
Parquet is typed at the source, so the CPU does nothing. The schema travels with the file. No inference, no type coercion, the engine reads column batches and writes them. No normalisation needed.
TPC-H SF20, 6.44 GB across 8 Parquet files, GCS → BigQuery via Arrow. Total 9m 24s (extract 7m 9s, normalize 8s, load 1m 58s), 166.8M rows at ~389K rows/sec.
| Table | Rows | Size in BigQuery |
|---|---|---|
| lineitem | 119,994,608 | 19.53 GB |
| orders | 30,000,000 | 3.60 GB |
| partsupp | 16,000,000 | 2.47 GB |
| part | 4,000,000 | 535.67 MB |
| customer | 3,000,000 | 519.23 MB |
| supplier | 200,000 | 30.60 MB |
| Total | 173,194,638 | 26.69 GB |
Normalize: 8 seconds, against 30 minutes for JSON. Same engine, same worker. The only thing that changed is the source carried its schema. Rate: ~170 GB/hour, ~1B rows/hour. (6.44 GB of Parquet lands as 26.69 GB in BigQuery — that's compression unpacking on read, not duplication.)
JSON took 48 minutes, Parquet took 9, and Parquet moved 25x more rows doing it. If you control how the data gets written, writing Parquet instead of JSON is the largest single lever on this bill.
dlt runs every bottleneck at its ceiling
Four cases, same worker, ~30x between fastest and slowest.
- SQL ran at the network's ceiling.
- Parquet ran at I/O's ceiling.
- JSON ran the CPU flat out until the cores ran out.
- REST sat at exactly the API's rate and added nothing on top of the wait.
A tool that's worth its compute bill is one that pins each workload against its actual bottleneck and then disappears, so the only thing left to optimize is the bottleneck itself, not the engine in front of it.
Try it on your own data
dltHub bills time, not rows or GB, so the fan-out that runs up a per-row invoice costs you nothing here. The trial includes 30 hours, no credit card. dltHub Pro is $119 for 50 hours, then $1 an hour after that: dlthub.com.