Migrate your SQL data pipeline from Stitch data to dlt
- Aman Gupta,
Jr. Data Engineer
Introduction
SQL exists in many flavors, including MySQL, Postgres, SQLite, Oracle, Microsoft SQL Server and many more. Often organizations need to send their SQL data to destinations like BigQuery, Redshift, Snowflake, and Databricks to enable advanced usage or analytical capabilities. Tools like Stich Data are good choice for migrating and syncing data from SQL to these destinations. However, there are scenarios where you need more control and customization over your data migration. This is where open-source tools like dlt
come into play.
In this guide, we will explore how to migrate your SQL data pipeline from Stitch to dlt
. While we focus on MySQL in this example, a similar approach can be applied to other SQL databases.
The need for migration: Recognizing the call
Stitch Data, a widely recognized service, offers a streamlined approach to data integration, but at a price. Imagine you have four tables, each with 100,000 rows, and you need to sync this data four times a day. This could easily run you around $500 a month. And that’s with limitations: you can only write data to one destination. Want to send your data to multiple places? That’s going to cost you more.
The financial costs only grows as your data expands or as you introduce more connectors. It’s a model where scalability becomes synonymous with expense.
dlt
, an open-source alternative. With dlt
, you’re free from these constraints. No matter how large your datasets grow or how many destinations you need to sync, dlt
remains free. Here, scalability is a given, not a cost driver.While Stitch Data offers a user-friendly setup and ease of use, dlt
distinguishes itself in more profound ways. It is not just faster but more efficient, with a level of customization, memory optimization, and transformation capability that speaks to its depth. The granularity and control dlt
offers make it an important tool for those tackling the intricate challenges of large-scale data tasks.
The other factors major factor that make dlt
a winner apart from Cost is Speed. We did some benchmarking for speed comparing various dlt
backends and Stitch Data and found out a speed increase of 9x using dlt's connectorX backend.
Dataset used: UNSW-NB15
Rows synced: 2.1 million
For any organization that wants to reduce its data ingestion costs and improve ingestion speed, open-source dlt
seems to be a better choice.
Preparing for the migration
In this tutorial, we’ll be migrating SQL-BigQuery pipeline running on Stitch Data to dlt
, as follows:
Step 1: Disable the Stitch Data pipeline
Let’s assume that you are using your already-built and running Stitch Data pipeline. Our first step would be to disable the already running SQL pipeline as follows:
Steps to follow:
- Navigate to the "Integrations" section on the top panel.
- Choose the Postgres connection used for transferring data.
- Click the “Integration is Active” button to paused to stop data syncing from MySQL to BigQuery.
- Note the date and time that you pause the pipeline on, as it is to be used in configuring the
dlt
pipeline.
Let’s look at the “contacts” table created by Stitch data in BigQuery named “Stitch_contacts”. Here’s the schema:
DB Diagram: Stitch contacts
Step 2: Create a new dlt
Pipeline with incremental loading
With the Stitch data connector paused, any new data flowing into SQL post-migration cutoff needs to be captured. Here, dlt
becomes our new vessel. And we create a pipeline to handle incremental loads, ensuring no new data is lost in transition.
Here are the steps:
2.1. Initialize the dlt
Pipeline
To begin, initialize a dlt
pipeline for SQL database. This involves setting up the pipeline with the desired source (SQL database) and destination (BigQuery, in this case).
dlt init sql_database bigquery
This command creates a new directory with the necessary files and configurations.
2.2. Pipeline Setup Guide
To learn to set up the SQL database-BigQuery dlt
pipeline, please read our verified source setup guide here.
2.3. Setup dlt pipeline
Next, we'll set up a pipeline to extract data from MySQL and load it into BigQuery. We'll configure this pipeline to load the data incrementally.
How to do incremental loading with dlt
?
- Here is a walkthrough for SQL tables in our docs: Link to the new docs
- For more information about incremental loading, please refer to the official documentation here.
An example:
dlt
enables you to do incremental loading. In this context, we need to give our pipeline a reference in time after which it should fetch the data from MySQL, so we’ll use the created
column to fetch records created after the initial load date(which in our case is the Stitch data integration pause date).
dlt
uses sources and resources to represent sources and their endpoints/tables. In the SQL pipeline, our source is the schema and the resource is a sql_table
resource. We will specifically discuss how to load the "contact" table incrementally using the corresponding resource starting from an initial value (where we left off in 5tran). This will allow us to load data added after a specific date and time.
Here’s an example of a pipeline loading the “Contact” table incrementally.
def load_standalone_table_resource() -> None:
"""Load a few known tables with the standalone sql_table resource"""
pipeline = dlt.pipeline(
pipeline_name="mysql_database",
destination='bigquery',
dataset_name="dlt_contacts",
full_refresh=True,
)
# Load a table incrementally starting at a given date
contact = sql_table(
table="Contact",
incremental=dlt.sources.incremental(
"Created", initial_value=datetime(2024, 4, 1, 0, 0, 0))
),
)
# Run the pipeline
info = pipeline.run(contact)
# print the info
print(info)
In this function, created
is used as the incremental key to load only new records only the new records from Contact table. To learn more about the adding incremental configuration to SQL resources, please refer to our documentation here.
Step 3: Running the Pipeline
Now, you can run the pipeline:
python sql_database_pipeline.py
Here’s the “Contact” table schema created by dlt
named dlt_contacts
:
DB Diagram: dlt contacts. Staging layer: Why do we need it?
To create a final view that includes both historical data from Stitch and new data from dlt
, we will treat the Stitch data and dlt
tables as staging tables. Staging tables serve as intermediate storage areas where data can be cleaned, transformed, and validated before being loaded into the final destination. This approach facilitates the merging of datasets from both Stitch and dlt
, ensuring a cohesive and accurate final view.
Creating the unified views
Moving forward, we need to create a unified view where data from Stitch and dlt
converge into a single perspective. This merging process will be orchestrated using SQL views, ensuring an integration of both data sources.
Here's a snippet to illustrate the merging process:
WITH Stitch_contacts AS (
SELECT
contactid as id,
created,
updated,
fullname as name,
type,
about,
phone01 as phone,
contactref as reference_no,
occupation,
difficulty,
CAST(NULL AS STRING) AS _dlt_id,
CAST(NULL AS STRING) AS _dlt_load_id,
_sdc_table_version,
_sdc_received_at,
_sdc_sequence,
_sdc_batched_at
FROM
`dlthub-analytics.gcpmysql.Contact`
WHERE created < '2024-01-01'
),
dlt_contacts AS (
SELECT
contact_id as id,
created,
updated,
full_name as name,
type,
about,
phone01 as phone,
contact_ref as reference_no,
occupation,
difficulty,
_dlt_load_id, -- dlt metadata
_dlt_id,
CAST(NULL AS INT64) AS _sdc_table_version,
CAST(NULL AS TIMESTAMP) AS _sdc_received_at,
CAST(NULL AS INT64) AS _sdc_sequence,
CAST(NULL AS TIMESTAMP) AS _sdc_batched_at
FROM
`dlthub-analytics.mysql_data_contact_20240814081752.contact`
WHERE created >= '2024-01-01'
)
SELECT * FROM Stitch_contacts
UNION ALL
SELECT * FROM dlt_contacts;
This SQL snippet illustrates the merging of deals data from both Stitch and dlt
sources, ensuring a unified view.
dlt
, careful attention must be given to naming conventions and data types. You can use the CAST
function to change data types as needed, as demonstrated in the example above.DB Diagram: Merged contacts
The methodology outlined in this guide ensures a smooth transition of your data pipeline from Stitch to dlt
.
About Metadata
The Stitch metadata includes:
_sdc_table_version
: Represents the version of the table at the time the data was extracted._sdc_received_at
: Timestamp indicating when the record was received by Stitch._sdc_sequence
: An integer that tracks the sequence of records, helping to maintain the order of data as it was extracted from the source system._sdc_batched_at
: Timestamp marking when the record was batched for loading into the destination._sdc_primary_keys
: Contains info about the the primary key fields of the source table._sdc_rejected
: Indicates whether a record was rejected during the loading process, often due to validation errors or schema mismatches.
dlt
assigns two key fields to each row: _dlt_id
and _dlt_load_id
.
- The
_dlt_id
uniquely identifies each row, ensuring that every record can be traced back to its specific instance in the dataset. - The
_dlt_load_id
is unique to each load operation, providing a way to track and audit data loads, especially in the context of incremental loading. This metadata is important for maintaining data integrity and understanding the history of data changes across different load operations.
To maintain the history of data dlt
, employs an incremental loading strategy, typically SCD2 (Slowly Changing Dimension Type 2), to track changes in the data. This approach introduces additional columns such as _dlt_valid_from
and _dlt_valid_to
to understand the temporal validity of the data, indicating when a record was active in the source system. However, it's essential to note that this method applies only to new data being ingested by dlt
.
Conclusion
In the article above, we explored the process of migrating the "contacts" table from a MySQL database to BigQuery from Stitch data to dlt
. This methodology can be adapted for other databases like PostgreSQL, SQLite, Oracle, and Microsoft SQL Server, for migrating to various destinations supported by dlt
. If you have a new destination in mind, you can either request it or contribute to the development yourself. The principles discussed remain consistent, regardless of the type of data. By following these straightforward steps, you can transition your data ingestion from Stitch data to dlt
.
Call to action
Need to move from another vendor to the open source standard dlt
and need help with the code generation? Do you want the boilerplate SQL to be created with high automation and precision? Then get in touch with our Solutions Engineering team for assistance!