dltHub
Blog /

Migrate your SQL data pipeline from Stitch data to dlt

  • Aman Gupta,
    Junior Data Engineer

Introduction

SQL exists in many flavors, including MySQL, Postgres, SQLite, Oracle, Microsoft SQL Server and many more. Often organizations need to send their SQL data to destinations like BigQuery, Redshift, Snowflake, and Databricks to enable advanced usage or analytical capabilities. Tools like Stich Data are good choice for migrating and syncing data from SQL to these destinations. However, there are scenarios where you need more control and customization over your data migration. This is where open-source tools like dlt come into play.

In this guide, we will explore how to migrate your SQL data pipeline from Stitch to dlt. While we focus on MySQL in this example, a similar approach can be applied to other SQL databases.

The need for migration: Recognizing the call

Stitch Data, a widely recognized service, offers a streamlined approach to data integration, but at a price. Imagine you have four tables, each with 100,000 rows, and you need to sync this data four times a day. This could easily run you around $500 a month. And that’s with limitations: you can only write data to one destination. Want to send your data to multiple places? That’s going to cost you more.

The financial costs only grows as your data expands or as you introduce more connectors. It’s a model where scalability becomes synonymous with expense.

In contrast, there’s dlt, an open-source alternative. With dlt, you’re free from these constraints. No matter how large your datasets grow or how many destinations you need to sync, dlt remains free. Here, scalability is a given, not a cost driver.

While Stitch Data offers a user-friendly setup and ease of use, dlt distinguishes itself in more profound ways. It is not just faster but more efficient, with a level of customization, memory optimization, and transformation capability that speaks to its depth. The granularity and control dlt offers make it an important tool for those tackling the intricate challenges of large-scale data tasks.

The other factors major factor that make dlt a winner apart from Cost is Speed. We did some benchmarking for speed comparing various dlt backends and Stitch Data and found out a speed increase of 9x using dlt's connectorX backend.

Dataset used: UNSW-NB15
Rows synced: 2.1 million

Speed test

For any organization that wants to reduce its data ingestion costs and improve ingestion speed, open-source dlt seems to be a better choice.

Preparing for the migration

In this tutorial, we’ll be migrating SQL-BigQuery pipeline running on Stitch Data to dlt, as follows:

Step 1: Disable the Stitch Data pipeline

Let’s assume that you are using your already-built and running Stitch Data pipeline. Our first step would be to disable the already running SQL pipeline as follows:

Steps to follow:

  1. Navigate to the "Integrations" section on the top panel.
  2. Choose the Postgres connection used for transferring data.
  3. Click the “Integration is Active” button to paused to stop data syncing from MySQL to BigQuery.
  4. Note the date and time that you pause the pipeline on, as it is to be used in configuring the dlt pipeline.

Let’s look at the “contacts” table created by Stitch data in BigQuery named “Stitch_contacts”. Here’s the schema:

DB Diagram: Stitch contacts

IMPORTANT
In this guide, we will focus on migrating the 'contacts' table. The approach outlined can be similarly applied to migrating any other table from any SQL source or any other source.

Step 2: Create a new dlt Pipeline with incremental loading

With the Stitch data connector paused, any new data flowing into SQL post-migration cutoff needs to be captured. Here, dlt becomes our new vessel. And we create a pipeline to handle incremental loads, ensuring no new data is lost in transition.

Here are the steps:

2.1. Initialize the dlt Pipeline

To begin, initialize a dlt pipeline for SQL database. This involves setting up the pipeline with the desired source (SQL database) and destination (BigQuery, in this case).

dlt init sql_database bigquery

This command creates a new directory with the necessary files and configurations.

2.2. Pipeline Setup Guide

To learn to set up the SQL database-BigQuery dlt pipeline, please read our verified source setup guide here.

2.3. Setup dlt pipeline

Next, we'll set up a pipeline to extract data from MySQL and load it into BigQuery. We'll configure this pipeline to load the data incrementally.

How to do incremental loading with dlt?

  • Here is a walkthrough for SQL tables in our docs: Link to the new docs
  • For more information about incremental loading, please refer to the official documentation here.

An example:

dlt enables you to do incremental loading. In this context, we need to give our pipeline a reference in time after which it should fetch the data from MySQL, so we’ll use the created column to fetch records created after the initial load date(which in our case is the Stitch data integration pause date).

dlt uses sources and resources to represent sources and their endpoints/tables. In the SQL pipeline, our source is the schema and the resource is a sql_table resource. We will specifically discuss how to load the "contact" table incrementally using the corresponding resource starting from an initial value (where we left off in 5tran). This will allow us to load data added after a specific date and time.

Here’s an example of a pipeline loading the “Contact” table incrementally.

def load_standalone_table_resource() -> None:
    """Load a few known tables with the standalone sql_table resource"""
    pipeline = dlt.pipeline(
        pipeline_name="mysql_database",
        destination='bigquery',
        dataset_name="dlt_contacts",
        full_refresh=True,
    )
    
    # Load a table incrementally starting at a given date
    contact = sql_table(
    table="Contact",
    incremental=dlt.sources.incremental(
        "Created", initial_value=datetime(2024, 4, 1, 0, 0, 0))
    ),
    )
    
    # Run the pipeline
    info = pipeline.run(contact)
    
    # print the info
    print(info)

In this function, created is used as the incremental key to load only new records only the new records from Contact table. To learn more about the adding incremental configuration to SQL resources, please refer to our documentation here.

Step 3: Running the Pipeline

Now, you can run the pipeline:

python sql_database_pipeline.py

Here’s the “Contact” table schema created by dlt named dlt_contacts:

DB Diagram: dlt contacts. Staging layer: Why do we need it?

To create a final view that includes both historical data from Stitch and new data from dlt, we will treat the Stitch data and dlt tables as staging tables. Staging tables serve as intermediate storage areas where data can be cleaned, transformed, and validated before being loaded into the final destination. This approach facilitates the merging of datasets from both Stitch and dlt, ensuring a cohesive and accurate final view.

Creating the unified views

Moving forward, we need to create a unified view where data from Stitch and dlt converge into a single perspective. This merging process will be orchestrated using SQL views, ensuring an integration of both data sources.

Here's a snippet to illustrate the merging process:

WITH Stitch_contacts AS (
  SELECT 
    contactid as id,
    created,
    updated,
    fullname as name,
    type,
    about,
    phone01 as phone,
    contactref as reference_no,
    occupation,
    difficulty,
    CAST(NULL AS STRING) AS _dlt_id,
    CAST(NULL AS STRING) AS _dlt_load_id,
    _sdc_table_version,
    _sdc_received_at,
    _sdc_sequence,
    _sdc_batched_at
    
  FROM 
    `dlthub-analytics.gcpmysql.Contact`
  WHERE created < '2024-01-01'
),

dlt_contacts AS (
  SELECT 
    contact_id as id,
    created,
    updated,
    full_name as name,
    type,
    about,
    phone01 as phone,
    contact_ref as reference_no,
    occupation,
    difficulty,
    _dlt_load_id,  -- dlt metadata
    _dlt_id,
    CAST(NULL AS INT64) AS _sdc_table_version,
    CAST(NULL AS TIMESTAMP) AS _sdc_received_at,
    CAST(NULL AS INT64) AS _sdc_sequence,
    CAST(NULL AS TIMESTAMP) AS _sdc_batched_at
  FROM 
    `dlthub-analytics.mysql_data_contact_20240814081752.contact`
  WHERE created >= '2024-01-01'
)

SELECT * FROM Stitch_contacts 
UNION ALL
SELECT * FROM dlt_contacts;

This SQL snippet illustrates the merging of deals data from both Stitch and dlt sources, ensuring a unified view.

NOTE: To unify the data from Stitch with dlt, careful attention must be given to naming conventions and data types. You can use the CAST function to change data types as needed, as demonstrated in the example above.

DB Diagram: Merged contacts

The methodology outlined in this guide ensures a smooth transition of your data pipeline from Stitch to dlt .

About Metadata

The Stitch metadata includes:

  • _sdc_table_version: Represents the version of the table at the time the data was extracted.
  • _sdc_received_at: Timestamp indicating when the record was received by Stitch.
  • _sdc_sequence: An integer that tracks the sequence of records, helping to maintain the order of data as it was extracted from the source system.
  • _sdc_batched_at: Timestamp marking when the record was batched for loading into the destination.
  • _sdc_primary_keys: Contains info about the the primary key fields of the source table.
  • _sdc_rejected: Indicates whether a record was rejected during the loading process, often due to validation errors or schema mismatches.

dlt assigns two key fields to each row: _dlt_id and _dlt_load_id.

  • The _dlt_id uniquely identifies each row, ensuring that every record can be traced back to its specific instance in the dataset.
  • The _dlt_load_id is unique to each load operation, providing a way to track and audit data loads, especially in the context of incremental loading. This metadata is important for maintaining data integrity and understanding the history of data changes across different load operations.

To maintain the history of data dlt, employs an incremental loading strategy, typically SCD2 (Slowly Changing Dimension Type 2), to track changes in the data. This approach introduces additional columns such as _dlt_valid_from and _dlt_valid_to to understand the temporal validity of the data, indicating when a record was active in the source system. However, it's essential to note that this method applies only to new data being ingested by dlt.

Conclusion

In the article above, we explored the process of migrating the "contacts" table from a MySQL database to BigQuery from Stitch data to dlt. This methodology can be adapted for other databases like PostgreSQL, SQLite, Oracle, and Microsoft SQL Server, for migrating to various destinations supported by dlt. If you have a new destination in mind, you can either request it or contribute to the development yourself. The principles discussed remain consistent, regardless of the type of data. By following these straightforward steps, you can transition your data ingestion from Stitch data to dlt.

Call to action

Need to move from another vendor to the open source standard dlt and need help with the code generation? Do you want the boilerplate SQL to be created with high automation and precision? Then get in touch with our Solutions Engineering team for assistance!