Self hosted tools Benchmarking
- Aman Gupta,
Jr. Data Engineer
Almost every data stack has SQL as a source. Since SQL RDBMS hasn't really changed how they work fundamentally over the years, you'd expect SQL to SQL copy to be a solved problem. Instead, here we are.
Our benchmark approach
Let’s benchmark the existing OSS solutions for this. Here are the tools we tested:
- Meltano - v3.5.4
- Airbyte - v1.1.0
- dlt - v1.2.0
- Sling - 1.2.2
We used the latest versions of the tools available at the time of testing.
The dataset: TPC-H benchmark
We used the industry-standard TPC-H dataset for our benchmark, consisting of:
- Total size: ~6.42 GB in PostgreSQL
- 7 tables
- 24,604,860 rows
Environment Used
Hosted on Google Cloud VM with the following setup:
- CPUs: 8 cores
- Memory: 16 GB RAM
- CPU Base Frequency: 2.3 GHz
Setup
Each tool's configuration is closely aligned with typical deployment practices. Here’s a brief description of setting up each tool.
1. dlt
- Installed via
pip
and initialized withdlt init
. - Configured with
config.toml
andsecrets.toml.
- The pipeline was executed using
sql_database_pipeline.py
.
2. Airbyte
- Runs locally with
abctl
on the VM. - The setup involves configuring sources and destinations through Airbyte’s web UI for interactive pipeline management.
3. Meltano
- Installed on the VM and configured for running taps and loaders.
meltano.yml.
- It includes setup for
tap-postgres
andtarget-bigquery
, with batch processing in JSONL format.
4. Sling
- Installed using
pip
and configured on the VM. - With essential paths and endpoints defined in config files to support efficient data ingestion.
Key findings
Speed Analysis
Here's how the tools performed in terms of pipeline execution speed:
Key Takeaways:
- dlt with ConnectorX and PyArrow is fast, outperforming Sling and Airbyte by 2.8x to 6x, depending on the backend.
- Sling is slightly faster than SQLAlchemy and Airbyte.
- Meltano shows reliable performance but doesn’t match the speed of dlt or other tools in this comparison.
Ease of Use
Here’s how each tool performs in terms of user experience:
1. dlt
- Straightforward setup and user-friendly.
- Compatible with serverless platforms like Google Cloud Functions or Cloud Run, which reduces costs and retains state between runs.
- This makes it an ideal choice for smaller or more extensive data processes.
2. Airbyte
- User-friendly with an interactive web UI, though it requires a constantly running VM.
- This may complicate long-term maintenance as it’s not serverless-compatible.
3. Meltano
- Relatively easy to set up and run.
- However, data syncing could be faster compared to other tools, making it better suited for larger, less time-sensitive datasets.
4. Sling
- Lightweight and simple to use, with minimal configuration required.
- Its efficiency makes it a good choice for smaller, near-real-time data processes.
Performance:
The two approaches (Infer types vs capture types) scale differently when tuned. The backends can be configured as per requirements. Take a look at ConnectorX and SQLAlchemy backends fine tuning here.
To read more about optimizing dlt please refer to our documentation here.