Blog /October 30, 2024

SQL Benchmarking: comparing data pipeline tools

Aman Gupta,
Jr. Data Engineer

SQL Benchmarking: comparing data pipeline tools

Transferring data from SQL databases to data warehouses like BigQuery, Redshift, and Snowflake is a common part of data platforms. And with various tools available, how do you choose the right one for your needs? We conducted a detailed benchmark test to answer this question, comparing popular tools like Fivetran, Stitch, Airbyte, and the data load tool (dlt).

Our benchmark approach

We analyzed these tools across several key parameters:

Cost
Speed
Ease of use
Customizability

The dataset: TPC-H benchmark

We used the industry-standard TPC-H dataset for our benchmark, consisting of:

Total size: ~6.42 GB in PostgreSQL
7 tables
24,604,860 rows

Environment used

For Airbyte, Fivetran, and Stitch data, we used their native cloud environments. For dlt, we used Google's Cloud Run to deploy the pipelines as jobs, utilizing the following configuration:

vCPUs: 8
Memory: 16 GB RAM

Hosting and Destination

Database Hosting: PostgreSQL 15 on GCP Cloud SQL
Destination: BigQuery

Key findings

Speed Analysis

Here's how the tools performed in terms of pipeline execution speed:

Key Takeaways:

dlt using the ConnectorX backend was fastest
2x slower than ConnectorX: dlt with Pyarrow or Pandas, Fivetran
6x slower: Airbyte, dlt sqlalchemy backend with default performance configuration.
10x slower: Stitch data

Cost analysis

With a sync frequency of once daily, here's how the tools compare in terms of monthly cost:

Key Takeaways:

By using dlt for your once-daily syncs, you can achieve substantial cost savings, spending only a fraction of what you'd pay with Airbyte, Fivetran, or Stitch Data while effectively managing your data pipelines.

Performance Tuning for dlt

The two different types of approaches (Infer types vs capture types) scale differently when tuned. Let’s look at ConnectorX and SQLAlchemy backends.

1. ConnectorX:

1.1 Parallelization impact

dlt has the option of using parallelized extraction while running the pipeline. Here's how parallelization affects:

Without parallelism:
- Extract: 10.9 min
- Normalization: 0.56 sec
- Load: 1.13 min
- Total time: 12.58 min
With parallelism (Using five workers):
- Extract: 3.53 min
- Normalize: 1.23 sec
- Load: 1.13 min
- Total time: 4.65 min

Key Takeaways:

Time Reduction: Total pipeline time decreased significantly from 12.58 minutes to 4.65 minutes with parallelization.
Extraction: Extraction time dropped from 10.9 minutes to 3.53 minutes due to parallel processing.
Load Time: Load time remained consistent at 1.13 minutes across both configurations, indicating no change with parallelization.

1.2 Worker configuration impact

Next, we experimented with how the number of workers in extract, normalize, and load affected the pipeline run times. Since using ConnectorX means we do not have to transform the data, the transform time represents only the overhead of handing the chunks. A small amount of parallelism helps, but not a lot since there are limits on the database and network throughoput/ transfer time.

Key Takeaways:

Scaling the workers from one to five provided a significant speed boost
However, increasing the worker count beyond five offered minimal additional gains due to reaching the limits of source/destination answer time and network throughput.

2. SQLAlchemy:

2.1 Parallelization impact

Without parallelism:
- Extract: 18.5 min
- Normalization: 36.23 min
- Load: 11.83 min
- Total time: 69.5 min
With parallelism (Using twenty workers):
- Extract: 20.27 min
- Normalize: 5.73 min
- Load: 2.5 min
- Total time: 28.5 min

Key Takeaways:

Normalisation time reduced 84% by using parallelism
Load time reduced 86% with parallelism
Extract unaffected; Likely db or network bottlenecks.

2.2 Worker configuration impact

Since the SQLAlchemy backend deserialises and re-serialises the data, transformation time is large and can be reduced significantly with parallelism.

Overall, leveraging parallelism we are able to reduce the time to 40%, where we can no longer accelerate it due to network and database limitations.

Key Takeaway:
A little bit of parallelism goes a long way, but in the normalisation step we can make the biggest difference by using more compute.

Conclusion

This thorough benchmark demonstrated a few things:

- Saas vendors offer unoptimised solutions at 100-1000x price over cost

- Network and API bottlenecks limit pipeline performance significantly

- Python performance in EL is not a bottleneck. Optimisations like faster languages or more hardware have insignificant returns if code is optimised.

Need help optimising your pipelines?

Reach out to our solutions engineers to get your team trained on dlt best practices and its ecosystem, completely customized to your data stack.

cognee - Scalable Data Layer for AI Apps - ECL Pipelines

SQL Benchmarking: comparing data pipeline tools

Our benchmark approachLink icon

The dataset: TPC-H benchmarkLink icon

Environment usedLink icon

Hosting and DestinationLink icon

Key findingsLink icon

Speed AnalysisLink icon

Cost analysisLink icon

Performance Tuning for dltLink icon

1. ConnectorX:Link icon

1.1 Parallelization impactLink icon

1.2 Worker configuration impactLink icon

2. SQLAlchemy:Link icon

2.1 Parallelization impactLink icon

2.2 Worker configuration impactLink icon

ConclusionLink icon

Need help optimising your pipelines?Link icon

Our benchmark approach

The dataset: TPC-H benchmark

Environment used

Hosting and Destination

Key findings

Speed Analysis

Cost analysis

Performance Tuning for dlt

1. ConnectorX:

1.1 Parallelization impact

1.2 Worker configuration impact

2. SQLAlchemy:

2.1 Parallelization impact

2.2 Worker configuration impact

Conclusion

Need help optimising your pipelines?