Blog /November 19, 2024

Self hosted tools Benchmarking

Aman Gupta,
Jr. Data Engineer

Almost every data stack has SQL as a source. Since SQL RDBMS hasn't really changed how they work fundamentally over the years, you'd expect SQL to SQL copy to be a long solved problem, but as we can see, it's still a challenge people are solving in 2024.

Our benchmark approach

Let’s benchmark the existing OSS solutions for this. Here are the tools we tested:

Meltano - v3.5.4
Airbyte - v1.1.0
dlt - v1.2.0
Sling - 1.2.2

We used the latest versions of the tools available at the time of testing.

The dataset: TPC-H benchmark

We used the industry-standard TPC-H dataset for our benchmark, consisting of:

Total size: ~6.42 GB in PostgreSQL
7 tables
24,604,860 rows

Environment Used

Hosted on Google Cloud VM with the following setup:

CPUs: 8 cores
Memory: 16 GB RAM
CPU Base Frequency: 2.3 GHz

Setup

Each tool's configuration is closely aligned with typical deployment practices. Here’s a brief description of setting up each tool.

1. dlt

Installed via pip and initialized with dlt init.
Configured with config.toml and secrets.toml.
The pipeline was executed using sql_database_pipeline.py.

2. Airbyte

Runs locally with abctl on the VM.
The setup involves configuring sources and destinations through Airbyte’s web UI for interactive pipeline management.

3. Meltano

Installed on the VM and configured for running taps and loaders. meltano.yml.
It includes setup for tap-postgres and target-bigquery, with batch processing in JSONL format.

4. Sling

Installed using pip and configured on the VM.
With essential paths and endpoints defined in config files to support efficient data ingestion.

5. PyAirbyte

Installed using pip and configured on the VM.
Encountered issues due to limited support for certain data sources like "source-postgres" and "source-mysql".

Key findings

Speed Analysis

Here's how the tools performed in terms of pipeline execution speed:

Key Takeaways:

dlt with ConnectorX and PyArrow is fast, outperforming Sling and Airbyte by 2.8x to 6x, depending on the backend.
Sling is slightly faster than SQLAlchemy and Airbyte.
Meltano shows reliable performance but doesn’t match the speed of dlt or other tools in this comparison.

Ease of Use

Here’s how each tool performs in terms of user experience:

1. dlt

Straightforward setup and user-friendly.
Compatible with serverless platforms like Google Cloud Functions or Cloud Run, which reduces costs and retains state between runs.
This makes it an ideal choice for smaller or more extensive data processes.

2. Airbyte

User-friendly with an interactive web UI, though it requires a constantly running VM.
Operationalisation is more complicated and requires extra knowledge and effort to manage serverless.

3. Meltano

Relatively easy to set up and run.
However, data syncing could be faster compared to other tools, making it better suited for larger, less time-sensitive datasets.

4. Sling

Lightweight and simple to use, with minimal configuration required.
Its efficiency makes it a good choice for smaller, near-real-time data processes.

5. PyAirbyte

Installation was straightforward using pip.
However, we encountered challenges due to limited documentation and support for certain data sources.
As a result, we were unable to fully test PyAirbyte in this benchmark.

Performance:

The two approaches (Infer types vs capture types) scale differently when tuned. The backends can be configured as per requirements. Take a look at ConnectorX and SQLAlchemy backends fine tuning here.

To read more about optimizing dlt please refer to our documentation here.

Conclusion

While SQL-to-SQL data replication is a common need, not all tools perform equally. dlt stands out with its superior speed and ease of use, especially when utilizing ConnectorX and PyArrow. Sling offers a lightweight and efficient solution.

Airbyte and Meltano provide user-friendly interfaces and reliable performance but may require more resources or are better suited for less time-sensitive data.

We look forward to revisiting PyAirbyte once source support is expanded and the documentation is improved.

Need help optimising your pipelines?

Reach out to our solutions engineers to get your team trained on dlt best practices and its ecosystem, completely customized to your data stack.

cognee - Scalable Data Layer for AI Apps - ECL Pipelines

Data mesh as a requirement in decentralised energy

Our benchmark approachLink icon

The dataset: TPC-H benchmarkLink icon

Environment UsedLink icon

SetupLink icon

Key findingsLink icon

Speed AnalysisLink icon

Ease of UseLink icon

Performance:Link icon

ConclusionLink icon

Need help optimising your pipelines?Link icon