dltHub
Blog /

Migrate your SQL data pipeline from Fivetran to dlt

  • Aman Gupta,
    Junior Data Engineer

Introduction

SQL exists in many flavors, including MySQL, Postgres, SQLite, Oracle, Microsoft SQL Server and many more. Often organizations need to send their SQL data to destinations like BigQuery, Redshift, Snowflake, and Databricks to enable advanced usage or analytical capabilities. Tools like Fivetran are good choice for migrating and syncing data from SQL to these destinations. However, there are scenarios where you might need more control and customization over your data migration. This is where open-source tools like come into play.

In this guide, we will explore how to migrate your SQL data pipeline from Fivetran to dlt. While we focus on MySQL in this example, a similar approach can be applied to other SQL databases.

The need for migration: Recognizing the call

We conducted a deep dive comparative analysis between Fivetran and dlt. Fivetran excels in ease of use, built-in logging, and monitoring. On the other hand, dlt offers several key advantages like cost, speed, customization, transformation capabilities, granularity, and control. The numerous advantages of dlt make it the better choice for data teams that run their own pipelines.

Major factors that make dlt a winner are cost and speed. In Fivetran, syncing 4 tables with 50k rows synced every 6 hours would cost roughly $1000 monthly. Now, if your tables have millions of rows, the cost can be prohibitively high. dlt places no such cost slabs based on the size of data; it is free no matter how big your data is, and running it on various infrastructures to load this data would cost somewhere in the region of $10/month give or take depending on provider. In this case study Talktile uses dlt with AWS lambda for creating a low-code, high volume data processing platform.

Read the full case study: How Taktile built their compliant data stack with dlt and Snowflake.

Speed is another critical factor. In dlt, scalability through iterators, chunking, and parallelization, as well as yielding pages instead of rows, makes it perform faster than Fivetran. Some dlt users have reported a performance boost of 5-10x when migrating from Fivetran to dlt. We did some benchmarking for speed comparing various dlt backends and Fivetran:

Dataset used: UNSW-NB15
Rows synced: 2.1 million

So, from the results above we can see that dlt improved loading speed by 3x.

In this other case study, Yummy.eu costs reduced by 182x and loads 10x faster by using dlt.

For any organization that wants to reduce its data ingestion costs and improve ingestion speed, open-source dlt seems to be a better choice.

Preparing for the migration

In this tutorial we’ll be migrating SQL-BigQuery pipeline running on Fivetran to dlt , as follows:

Step 1: Pause Fivetran pipeline

Let’s assume that you are using your already-built and running Fivetran pipeline. Our first step would be to pause the already running SQL pipeline as follows:

Steps to follow:

  1. Navigate to the "Connectors" section on the left panel.
  2. Choose the MySQL connector used for transferring data.
  3. Click the pause button to stop data syncing from MySQL to BigQuery.
  4. Note the date and time that you pause the pipeline on, as it is to be used in configuring the dlt pipeline.

Let’s look at the “contacts” table created by Fivetran in BigQuery named “Fivetran_contacts”. Here’s the schema:

DB diagram: Fivetran_contacts.

IMPORTANT: In this guide, we will focus on migrating the 'contacts' table. The approach outlined can be similarly applied to migrating any other table from any SQL source or any other source.

Step 2: Create a new dlt Pipeline with incremental loading

With the Fivetran connector paused, any new data flowing into SQL post-migration cutoff needs to be captured. Here, dlt becomes our new vessel. And we create a pipeline to handle incremental loads, ensuring no new data is lost in transition.

Here are the steps:

2.1. Initialize the dlt Pipeline

To begin, initialize a dlt pipeline for SQL database. This involves setting up the pipeline with the desired source (SQL database) and destination (BigQuery, in this case).

dlt init sql_database bigquery


This command creates a new directory with the necessary files and configurations.

2.2. Pipeline Setup Guide

To learn to set up the SQL database-BigQuery dlt pipeline, please read our verified source setup guide here.

2.3. Setup dlt pipeline

Next, we'll set up a pipeline to extract data from MySQL and load it into BigQuery. We'll configure this pipeline to load the data incrementally.

How to do incremental loading with dlt?

  • Here is a walkthrough for SQL tables in our docs: Link to the new docs
  • For more information about incremental loading, please refer to the official documentation here.

An example:

dlt enables you to do incremental loading. In this context, we need to give our pipeline a reference in time after which it should fetch the data from MySQL, so we’ll use the created column to fetch records created after the initial load date(which in our case is the Fivetran migration pause date).

dlt uses sources and resources to represent sources and their endpoints/tables. In the SQL pipeline, our source is the schema and the resource is a sql_table resource. We will specifically discuss how to load the "contact" table incrementally using the corresponding resource starting from an initial value (where we left off in 5tran). This will allow us to load data added after a specific date and time.

Here’s an example of a pipeline loading the “Contact” table incrementally.

Incremental Loading
def load_standalone_table_resource() -> None:
    """Load a few known tables with the standalone sql_table resource"""
    pipeline = dlt.pipeline(
        pipeline_name="mysql_database",
        destination='bigquery',
        dataset_name="dlt_contacts",
        full_refresh=True,
    )
    
    # Load a table incrementally starting at a given date
    contact = sql_table(
    table="Contact",
    incremental=dlt.sources.incremental(
        "Created", initial_value=datetime(2024, 4, 1, 0, 0, 0))
    ),
    )
    
    # Run the pipeline
    info = pipeline.run(contact, write_disposition="merge")
    
    # print the info
    print(info)

In this function, created is used as the incremental key to load only new records only the new records from Contact table.

To learn more about the adding incremental configuration to SQL resources, please refer to our documentation here.

Step 3: Running the Pipeline

Now, you can run the pipeline:

python sql_database_pipeline.py


Here’s the “Contact” table schema created by dlt named dlt_contacts:

DB diagram: dlt_contacts.

Staging layer: Why do we need it?

To create a final view that includes both historical data from Fivetran and new data from dlt, we will treat the Fivetran and dlt tables as staging tables. Staging tables serve as intermediate storage areas where data can be cleaned,

transformed, and validated before being loaded into the final destination. This approach facilitates the merging of datasets from both Fivetran and dlt, ensuring a cohesive and accurate final view.

Creating the unified views

Moving forward, we need to create a unified view where data from Fivetran and dlt converge into a single perspective. This merging process will be orchestrated using SQL views, ensuring an integration of both data sources.

SQL
CREATE OR REPLACE VIEW  project-id.dataset.merged_contacts AS
WITH fivetran_contacts as (
  SELECT
    contactid as contact_id,
    about,
    addedby as added_by,
    assistant,
    avgrating as avg_rating,
    credit_score,
    business,
    commpreferred as comm_preferred,
    reference,
    created,
    CAST(NULL AS STRING) AS _dlt_load_id,  -- dlt metadata
    CAST(NULL AS STRING) AS _dlt_id,       -- dlt metadata
    _fivetran_deleted,                     -- Fivetran metadata
    _fivetran_synced                       -- Fivetran metadata
FROM
    project-id.fivetran_dataset.contact
WHERE created < '2024-07-01T12:30:02'      -- datetime when Fivetran was paused
), 

dlthub_contacts as (
 SELECT
    contact_id,
    about,
    added_by,
    assistant,
    avg_rating,
    credit_score,
    business,
    comm_preferred,
    reference,
    created,
    _dlt_load_id,                                -- dlt metadata
    _dlt_id,                                     -- dlt metadata
    CAST(NULL AS BOOLEAN) AS _fivetran_deleted,  -- Fivetran metadata
    CAST(NULL AS TIMESTAMP) AS _fivetran_synced  -- Fivetran metadata
FROM
project-id.dlt_dataset.contacts          
WHERE
created >= '2024-07-01T12:30:02'            -- datetime when Fivetran was paused
)

SELECT * FROM fivetran_contacts
UNION ALL
SELECT * FROM dlthub_contacts;


This SQL snippet illustrates the merging of deals data from both Fivetran and dlt sources, ensuring a unified view.

To unify the data from Fivetran with dlt, careful attention must be given to naming conventions and data types. You can use the CAST function to change data types as needed, as demonstrated in the example above.

Here is the schema for the merged view:

DB diagram: merged_view.

The methodology outlined in this guide ensures a smooth transition of your data pipeline from Fivetran to dlt .

About Metadata

In Fivetran, each row includes two important fields: _fivetran_deleted and _fivetran_synced. The _fivetran_deleted field is a boolean that indicates whether a record has been deleted in the source system. The _fivetran_synced field is a timestamp that marks the last time the record was updated.

To achieve similar functionality in dlt, we employ the SCD2 (Slowly Changing Dimension Type 2) incremental loading strategy. This approach introduces two additional columns to your dataset: _dlt_valid_from and _dlt_valid_to. These columns help us understand the temporal validity of the data, essentially tracking when a record was active in the source system. However, it's important to note that this method is applicable only to new data being ingested by dlt.

dlt, in contrast, assigns two key fields to each row: _dlt_id and _dlt_load_id. The _dlt_id uniquely identifies each row, while the _dlt_load_id is unique to each load operation. This metadata is essential for tracking and auditing data loads, particularly in the context of incremental loading.

Conclusion

In the article above, we explored the process of migrating the "contacts" table from a MySQL database to BigQuery. This methodology can be adapted for other databases like PostgreSQL, SQLite, Oracle, and Microsoft SQL Server, for migrating to various destinations supported by dlt. If you have a new destination in mind, you can either request it or contribute to the development yourself. The principles discussed remain consistent, regardless of the type of data. By following these straightforward steps, you can transition your data ingestion from Fivetran to dlt.

Call to action

Need to move from another vendor to the open source standard dlt and need help with the code generation? Want to spare your efforts and accelerate the migration?


Our Solutions Engineering team offers a paid drop-in replacement service that includes dbt package generation to assist in moving from your old solutions.


Get in touch with our Solutions Engineering team!