Blog/June 1, 2026/

Engineering,
Product

From compute hours to data moved: a benchmark series

You pay for compute hours; what you actually want is data moved. This post measures the exchange rate across the four bottlenecks that dominate real pipelines: SQL copy, REST APIs, JSON files, and Parquet

Aman Gupta,
Data Engineer

On this page

What an hour buys depends on what's in the way
SQL: the engine ceiling
REST: you don't set the rate, the API does
Files: the format decides, not the engine
dlt runs every bottleneck at its ceiling
Try it on your own data

What an hour buys depends on what's in the way

You pay dltHub for compute time. The thing you care about is how much data that time moves. The two aren't the same, and the ratio isn't one number — it's set by whatever is slowest in the path. So we ran four benchmarks on one 2 vCPU / 4 GB worker, one per bottleneck.

Case	Bottleneck	One hour moves
Parquet files	memory and I/O	~170 GB · ~1.1B rows
SQL (Postgres)	network and serialization	~65 GB · ~350M rows
JSON files	CPU, type inference	~4.6 GB · ~47M rows
REST (HubSpot, GitHub)	the source's rate limit	whatever the API allows — ~225–275k rows/h in these runs

Same machine, ~30x spread. The engine isn't the variable. The format and the source are.

A small team usually runs all four: a database on a schedule, a few APIs, some files. Budget about an hour a day for each and the numbers above are what land.

Tools that bill per row charge you for every nested record that fans out on the way in — and as you'll see, the fan-out is real. dltHub bills the time. The rest of this post is the methodology behind the table.

SQL: the engine ceiling

With nothing external in the way — no rate limit, no parsing, no I/O pressure — this is as fast as the engine goes. It's the ceiling the other three cases fall short of, and the reason they fall short tells you where your money goes.

Setup. 2 vCPU / 4 GB runner. Postgres on GCP (4 vCPU / 16 GB / 200 GB SSD, US) → BigQuery (US). pyarrow on the sql_database source. 8/8/8 workers, chunk size 150,000, full refresh. TPC-H at SF 5, 10, 20, 50.

SF	Rows	Postgres size	Runtime	GB/hour	M rows/hour
5	43.3M	8.21 GB	7m 16s	67.8	357
10	86.6M	16.41 GB	14m 58s	65.8	347
20	173.2M	32.82 GB	30m 14s	65.1	344
50	433.0M	82.04 GB	1h 12m 7s	68.3	360

The rate holds from 8 GB to 82 GB. It doesn't drift as the job grows, which is the property you want when you're billed by the hour: you can estimate a 10x larger job by multiplying, not by re-benchmarking.

1 hour = 65 GB, ~350M rows, Postgres to BigQuery — with source, runtime, and destination in the same region.

REST: you don't set the rate, the API does

SQL is network-bound and nothing else. A REST source moves the bottleneck off your machine entirely. Records per request, requests per window, how long a 429 makes you wait — those are the API's numbers, not yours. No worker size changes them.

So the question isn't how fast dltHub is here. It's whether dltHub adds anything on top of the wait. We ran two sources that bracket the range.

Setup. 2 vCPU / 4 GB (US) → BigQuery (US). HubSpot CRM REST API v3 (free trial), cursor pagination, ~100 records/request, 5 retries to 16s. GitHub GraphQL API, microsoft/vscode issues, 100 items/page, sleep-until-reset under 50 credits.

HubSpot — 4,960 records, total 2m 11s (1m 7s extract, 369ms normalize, 1m 3s load).

Object	Rows
Contacts	950
Companies	100
Deals	400
Notes	2,245
Tasks	1,265
Total	4,960

Normalize is 369 milliseconds. The run is request round-trips end to end; the worker is idle. That's the normal case for a CRM. Note the fan-out, because it's where per-row billing bites:

Pulled from the API: 4,960 records.
Landed in BigQuery: 9,177 — the 4,960 plus 4,217 association rows (companies__contacts, companies__deals, contacts__deals) that normalize into their own tables. 1.85x.
On dltHub: 2m 11s, ~$0.04 a run, ~$1/month daily. A per-row tool meters the 9,177, not the 2 minutes.

GitHub — 124,875 rows, total 33m 8s (32m 20s extract, 14s normalize, 32s load).

Table	Rows
issues	30,000
issues__comments	63,228
issues__comments__reactions	15,863
issues__reactions	15,784

Extract is 98% of the clock. Normalize and load are 46 seconds combined, for 125k nested rows. The other 32 minutes are the API handing back 100 items at a time and making us sleep until the window resets. That time is not ours to optimize, and a bigger worker doesn't touch it — on compute-hour billing you pay for the wait the source imposes, and that's the whole bill.

Both sources landed roughly 225–275k rows an hour — HubSpot ~252k, GitHub ~226k — but read that as the rate each API allowed on the day, not a number dlt sets. Point a faster API at the same worker and it goes faster; point a stricter one and it crawls.

Files: the format decides, not the engine

SQL and Parquet hand you a fixed shape — the types come with the data. JSON and REST don't: both arrive untyped, and the engine has to work out what every field is. The difference between the slow case and the fast one is exactly that — typed or untyped at the source. We ran both ends on the same worker.

JSON is untyped, so the CPU pays. Every field gets inspected, every type inferred, every nested array split into its own table. On messy data the inference is the job. We built the input to be hostile: 6.9M deals where a field is a number in one record and a string in the next, three date formats, values that are empty or null or missing. Normalize ends up being most of the run.

5 GB JSONL, 6.9M deals, GCS → BigQuery. Total 48m 16s (extract 15m 45s, normalize 30m 19s, load 2m 11s).

Table	Rows	Size in BigQuery
deals	6,902,021	1.34 GB
deals__contacts	17,260,564	1.73 GB
deals__tags	13,800,852	656 MB
Total	37,963,437	3.73 GB

Normalize is 63% of the run, CPU-bound and pinned on a 2-core box. This is the one case in the series where throwing hardware at it actually moves the number — a bigger machine, or splitting normalize across more workers. The work parallelizes; the 2 cores were the limit, not the engine. Rate: ~4.6 GB/hour landed, ~8.6M source records/hour. The 6.9M deals become ~38M rows once contacts and tags split out — size the job by source records, not by what lands, or you'll be off by 5x.

Parquet is typed at the source, so the CPU does nothing. The schema travels with the file. No inference, no type coercion, the engine reads column batches and writes them. No normalisation needed.

TPC-H SF20, 6.44 GB across 8 Parquet files, GCS → BigQuery via Arrow. Total 9m 24s (extract 7m 9s, normalize 8s, load 1m 58s), 166.8M rows at ~389K rows/sec.

Table	Rows	Size in BigQuery
lineitem	119,994,608	19.53 GB
orders	30,000,000	3.60 GB
partsupp	16,000,000	2.47 GB
part	4,000,000	535.67 MB
customer	3,000,000	519.23 MB
supplier	200,000	30.60 MB
Total	173,194,638	26.69 GB

Normalize: 8 seconds, against 30 minutes for JSON. Same engine, same worker. The only thing that changed is the source carried its schema. Rate: ~170 GB/hour, ~1B rows/hour. (6.44 GB of Parquet lands as 26.69 GB in BigQuery — that's compression unpacking on read, not duplication.)

JSON took 48 minutes, Parquet took 9, and Parquet moved 25x more rows doing it. If you control how the data gets written, writing Parquet instead of JSON is the largest single lever on this bill.

dlt runs every bottleneck at its ceiling

Four cases, same worker, ~30x between fastest and slowest.

SQL ran at the network's ceiling.
Parquet ran at I/O's ceiling.
JSON ran the CPU flat out until the cores ran out.
REST sat at exactly the API's rate and added nothing on top of the wait.

A tool that's worth its compute bill is one that pins each workload against its actual bottleneck and then disappears, so the only thing left to optimize is the bottleneck itself, not the engine in front of it.

Try it on your own data

dltHub bills time, not rows or GB, so the fan-out that runs up a per-row invoice costs you nothing here. The trial includes 30 hours, no credit card. dltHub is $119 for 50 hours, then $1 an hour after that: dlthub.com.

Blog/June 1, 2026/

Engineering,
Product

From compute hours to data moved: a benchmark series

Aman Gupta,
Data Engineer

On this page

What an hour buys depends on what's in the way
SQL: the engine ceiling
REST: you don't set the rate, the API does
Files: the format decides, not the engine
dlt runs every bottleneck at its ceiling
Try it on your own data

What an hour buys depends on what's in the way

Case	Bottleneck	One hour moves
Parquet files	memory and I/O	~170 GB · ~1.1B rows
SQL (Postgres)	network and serialization	~65 GB · ~350M rows
JSON files	CPU, type inference	~4.6 GB · ~47M rows
REST (HubSpot, GitHub)	the source's rate limit	whatever the API allows — ~225–275k rows/h in these runs

Same machine, ~30x spread. The engine isn't the variable. The format and the source are.

A small team usually runs all four: a database on a schedule, a few APIs, some files. Budget about an hour a day for each and the numbers above are what land.

SQL: the engine ceiling

SF	Rows	Postgres size	Runtime	GB/hour	M rows/hour
5	43.3M	8.21 GB	7m 16s	67.8	357
10	86.6M	16.41 GB	14m 58s	65.8	347
20	173.2M	32.82 GB	30m 14s	65.1	344
50	433.0M	82.04 GB	1h 12m 7s	68.3	360

1 hour = 65 GB, ~350M rows, Postgres to BigQuery — with source, runtime, and destination in the same region.

REST: you don't set the rate, the API does

So the question isn't how fast dltHub is here. It's whether dltHub adds anything on top of the wait. We ran two sources that bracket the range.

HubSpot — 4,960 records, total 2m 11s (1m 7s extract, 369ms normalize, 1m 3s load).

Object	Rows
Contacts	950
Companies	100
Deals	400
Notes	2,245
Tasks	1,265
Total	4,960

Normalize is 369 milliseconds. The run is request round-trips end to end; the worker is idle. That's the normal case for a CRM. Note the fan-out, because it's where per-row billing bites:

Pulled from the API: 4,960 records.
Landed in BigQuery: 9,177 — the 4,960 plus 4,217 association rows (companies__contacts, companies__deals, contacts__deals) that normalize into their own tables. 1.85x.
On dltHub: 2m 11s, ~$0.04 a run, ~$1/month daily. A per-row tool meters the 9,177, not the 2 minutes.

GitHub — 124,875 rows, total 33m 8s (32m 20s extract, 14s normalize, 32s load).

Table	Rows
issues	30,000
issues__comments	63,228
issues__comments__reactions	15,863
issues__reactions	15,784

Files: the format decides, not the engine

5 GB JSONL, 6.9M deals, GCS → BigQuery. Total 48m 16s (extract 15m 45s, normalize 30m 19s, load 2m 11s).

Table	Rows	Size in BigQuery
deals	6,902,021	1.34 GB
deals__contacts	17,260,564	1.73 GB
deals__tags	13,800,852	656 MB
Total	37,963,437	3.73 GB

TPC-H SF20, 6.44 GB across 8 Parquet files, GCS → BigQuery via Arrow. Total 9m 24s (extract 7m 9s, normalize 8s, load 1m 58s), 166.8M rows at ~389K rows/sec.

Table	Rows	Size in BigQuery
lineitem	119,994,608	19.53 GB
orders	30,000,000	3.60 GB
partsupp	16,000,000	2.47 GB
part	4,000,000	535.67 MB
customer	3,000,000	519.23 MB
supplier	200,000	30.60 MB
Total	173,194,638	26.69 GB

JSON took 48 minutes, Parquet took 9, and Parquet moved 25x more rows doing it. If you control how the data gets written, writing Parquet instead of JSON is the largest single lever on this bill.

dlt runs every bottleneck at its ceiling

Four cases, same worker, ~30x between fastest and slowest.

SQL ran at the network's ceiling.
Parquet ran at I/O's ceiling.
JSON ran the CPU flat out until the cores ran out.
REST sat at exactly the API's rate and added nothing on top of the wait.

What an hour buys depends on what's in the wayLink icon

SQL: the engine ceilingLink icon

REST: you don't set the rate, the API doesLink icon

Files: the format decides, not the engineLink icon

dlt runs every bottleneck at its ceilingLink icon

Try it on your own dataLink icon

What an hour buys depends on what's in the wayLink icon

SQL: the engine ceilingLink icon

REST: you don't set the rate, the API doesLink icon

Files: the format decides, not the engineLink icon

dlt runs every bottleneck at its ceilingLink icon

Try it on your own dataLink icon

What an hour buys depends on what's in the way

SQL: the engine ceiling

REST: you don't set the rate, the API does

Files: the format decides, not the engine

dlt runs every bottleneck at its ceiling

Try it on your own data

What an hour buys depends on what's in the way

SQL: the engine ceiling

REST: you don't set the rate, the API does

Files: the format decides, not the engine

dlt runs every bottleneck at its ceiling

Try it on your own data