dltHub
Blog /

Self hosted tools Benchmarking

  • Aman Gupta,
    Jr. Data Engineer

Almost every data stack has SQL as a source. Since SQL RDBMS hasn't really changed how they work fundamentally over the years, you'd expect SQL to SQL copy to be a solved problem. Instead, here we are.

Our benchmark approach

Let’s benchmark the existing OSS solutions for this. Here are the tools we tested:

  • Meltano - v3.5.4
  • Airbyte - v1.1.0
  • dlt - v1.2.0
  • Sling - 1.2.2

We used the latest versions of the tools available at the time of testing.

The dataset: TPC-H benchmark

We used the industry-standard TPC-H dataset for our benchmark, consisting of:

  • Total size: ~6.42 GB in PostgreSQL
  • 7 tables
  • 24,604,860 rows

Environment Used

Hosted on Google Cloud VM with the following setup:

  • CPUs: 8 cores
  • Memory: 16 GB RAM
  • CPU Base Frequency: 2.3 GHz

Setup

Each tool's configuration is closely aligned with typical deployment practices. Here’s a brief description of setting up each tool.

1. dlt

  • Installed via pip and initialized with dlt init.
  • Configured with config.toml and secrets.toml.
  • The pipeline was executed using sql_database_pipeline.py.

2. Airbyte

  • Runs locally with abctl on the VM.
  • The setup involves configuring sources and destinations through Airbyte’s web UI for interactive pipeline management.

3. Meltano

  • Installed on the VM and configured for running taps and loaders. meltano.yml.
  • It includes setup for tap-postgres and target-bigquery, with batch processing in JSONL format.

4. Sling

  • Installed using pip and configured on the VM.
  • With essential paths and endpoints defined in config files to support efficient data ingestion.

Key findings

Speed Analysis

Here's how the tools performed in terms of pipeline execution speed:

Key Takeaways:

  • dlt with ConnectorX and PyArrow is fast, outperforming Sling and Airbyte by 2.8x to 6x, depending on the backend.
  • Sling is slightly faster than SQLAlchemy and Airbyte.
  • Meltano shows reliable performance but doesn’t match the speed of dlt or other tools in this comparison.

Ease of Use

Here’s how each tool performs in terms of user experience:

1. dlt

  • Straightforward setup and user-friendly.
  • Compatible with serverless platforms like Google Cloud Functions or Cloud Run, which reduces costs and retains state between runs.
  • This makes it an ideal choice for smaller or more extensive data processes.

2. Airbyte

  • User-friendly with an interactive web UI, though it requires a constantly running VM.
  • This may complicate long-term maintenance as it’s not serverless-compatible.

3. Meltano

  • Relatively easy to set up and run.
  • However, data syncing could be faster compared to other tools, making it better suited for larger, less time-sensitive datasets.

4. Sling

  • Lightweight and simple to use, with minimal configuration required.
  • Its efficiency makes it a good choice for smaller, near-real-time data processes.

Performance:

The two approaches (Infer types vs capture types) scale differently when tuned. The backends can be configured as per requirements. Take a look at ConnectorX and SQLAlchemy backends fine tuning here.

To read more about optimizing dlt please refer to our documentation here.

Conclusion

Need help optimising your pipelines?

Reach out to our solutions engineers to get your team trained on dlt best practices and its ecosystem, completely customized to your data stack.