A guide on how to migrate your Hubspot data pipeline from Fivetran to dlt
- Aman Gupta,
Jr. Data Engineer
Introduction
As businesses grow, so does the amount of data they generate and store. There comes a time when a business must make a critical decision: how to manage the waves of information that have accumulated over the years. While tools like Fivetran are a good choice for data integration when using a SaaS, there are scenarios where you might need code-first control and customizability over your data integration and this is where open-source tools like dlt come into play.
In this guide, we will explore how to migrate your HubSpot data pipeline from Fivetran to dlt. While we focus on HubSpot data in this example, a similar approach can be applied to any data source.
Why migrate? Code-first teams favor standardisation.
While Fivetran offers ease of use, many teams eventually need a platform better aligned with their specific priorities. After a deep comparative analysis, it became clear that while Fivetran excels in its convenience in UI, dlt stands out for organizations that value control, customization, and cost efficiency.
The choice between Fivetran and dlt often comes down to paradigm preference. If simplicity and out-of-the-box functionality are your top priorities, Fivetran might be a solid fit. But if you're looking to consolidate and standardize your data platform with more granular control, lower costs, and faster ingestion, dlt becomes the preferred option.
A second driver behind migrations from Fivetran to dlt is cost. While Fivetran’s pricing scales with data size, dlt is open-source and places no limitations based on data volume. This can be a game-changer for teams looking to manage large datasets without escalating costs.
For organizations prioritizing a code-first approach that want full governance and control without accompanying costs, dlt is an obvious choice.
Preparing for the migration
In this tutorial, we’ll migrate the Hubspot-BigQuery pipeline running on Fivetran to dlt. As follows:
Step 1: Pause your current pipeline
Let’s assume that you are using your already-built and running Fivetran pipeline. Our first step would be to pause the already running HubSpot pipeline as follows:
Steps to follow:
- Navigate to the "Connectors" section on the left panel.
- Choose the Fivetran connector used for transferring data.
- Click the pause button to stop data syncing from HubSpot to BigQuery.
- Note the date and time that you pause the pipeline on, it is to be used in the dlt pipeline.
Step 2: Create a new dlt Pipeline with incremental loading
With the Fivetran connector paused, any new data flowing into HubSpot post-migration cutoff needs to be captured. Here, dlt becomes our new vessel. And we create a pipeline to handle incremental loads, ensuring no new data is lost in transition.
Here are the steps:
Setup dlt Pipeline
1. Install dlt:
pip install "dlt[bigquery]"
2. Configure the Pipeline: Follow the dlt HubSpot pipeline documentation to set up your pipeline. Ensure you have your HubSpot API key and destination credentials configured.
3. Run the Pipeline: Execute the pipeline to extract and load data from HubSpot into your data warehouse.
Staging layer: Why do we need it?
To create a final view that includes both historical data from Fivetran and new data from dlt, we will treat the Fivetran and dlt tables as staging tables. Staging tables serve as intermediate storage areas where data can be cleaned, transformed, and validated before being loaded into the final destination. This approach facilitates the merging of datasets from both Fivetran and dlt, ensuring a cohesive and accurate final view.
Creating the unified views
Moving forward, we need to create a unified view where data from Fivetran and dlt converge into a single perspective. This merging process will be orchestrated using SQL views, ensuring an integration of both data sources into one final view.
Here's a snippet to illustrate the merging process for deal_stages
:
WITH Fivetran_deal_stages AS (
SELECT
stage_id
, label
, display_order
, cast(closed_won as string) as closed_won
, created_at
, cast(probability as string) as string
, pipeline_id
, updated_at
, write_permissions
, CAST(NULL AS STRING) AS _dlt_root_id -- dlt metadata
, CAST(NULL AS STRING) AS _dlt_parent_id -- dlt metadata
, CAST(NULL AS INT64) AS _dlt_list_idx -- dlt metadata
, CAST(NULL AS STRING) AS _dlt_id -- dlt metadata
, _fivetran_deleted -- Fivetran metadata
, _fivetran_synced -- Fivetran metadata
FROM
`dlthub-analytics.hubspot.deal_pipeline_stage`
WHERE
created_at <= '2024-09-10 01:00:00 UTC'
),
dlt_pipelines_deals__stages as (
SELECT
s.id as stage_id
, s.label
, s.display_order
, metadata__is_closed as closed_won
, s.created_at
, metadata__probability as probability
, p.id as pipeline_id
, s.updated_at
, write_permissions
, _dlt_root_id -- dlt metadata
, _dlt_parent_id -- dlt metadata
, _dlt_list_idx -- dlt metadata
, s._dlt_id -- dlt metadata
, CAST(NULL AS BOOLEAN) AS _fivetran_deleted -- Fivetran metadata
, CAST(NULL AS TIMESTAMP) AS _fivetran_synced -- Fivetran metadata
FROM
`dlthub-analytics.hubspot_dataset_tst.pipelines_deals__stages` s
JOIN
`dlthub-analytics.hubspot_dataset_tst.pipelines_deals` p
ON
p._dlt_id = s._dlt_parent_id
WHERE
created_at > '2024-09-10 01:00:00 UTC'
)
SELECT * FROM Fivetran_deal_stages
UNION ALL
SELECT * FROM dlt_pipelines_deals__stages;
This SQL snippet illustrates the merging of deals_stages data from both Fivetran and dlt sources, ensuring a unified view.
CAST
function to change data types as needed, as demonstrated in the example above. Here is the schema for the merged view:
dbdiagram: link
The methodology outlined in this guide ensures a smooth transition of your data pipeline from Fivetran to dlt .
About Metadata
In Fivetran, each row includes two important fields: _fivetran_deleted
and _fivetran_synced
. The _fivetran_deleted
field is a boolean that indicates whether a record has been deleted in the source system. The _fivetran_synced
field is a timestamp that marks the last time the record was updated.
If we want to achieve similar functionality in dlt, we would need to adjust the incremental merge strategy in dlt to use the SCD2 (Slowly Changing Dimension Type 2) incremental loading strategy. This approach changes the granularity of your tables to version your records and introduces two additional columns to your dataset: _dlt_valid_from
and _dlt_valid_to
. These columns help us understand the temporal validity of the data, essentially tracking when a record was active in the source system. Do note, however, that unless you have explicit requirements for using SCD2, adding it increases the complexity of your data model significantly without any significant benefits and is best avoided by default.
dlt, in contrast, assigns key fields to each row: _dlt_id
, _dlt_load_id
, _dlt_root_id
, _dlt_parent_id
, and _dlt_list_idx
.
_dlt_id
: Uniquely identifies each row, generated either randomly or as a deterministic hash._dlt_load_id
: Unique to each load operation, crucial for tracking loads._dlt_root_id
: References the root table in nested structures, bypassing intermediate parents._dlt_parent_id
: Links to the parent row'srow_key
in nested tables._dlt_list_idx
: Indicates an item's position within a list, aiding inrow_key
creation.
This comprehensive set of metadata fields in dlt provides robust mechanisms for effective tracking and auditing of data loads, especially within the context of incremental loading.
Conclusion
In this article, we explored the step-by-step process of migrating your HubSpot data pipeline from Fivetran to dlt. While Fivetran provides ease of use and pre-built functionality, dlt offers a code-first customizable, cost-effective, and high-performance solution for teams that prioritize control and scalability. By consolidating both historical and new data into unified views, you ensure a seamless transition without sacrificing data integrity or performance.
The migration process isn't limited to HubSpot. The same principles and steps can be applied to other data sources like Salesforce, Shopify, and SQL databases. With dlt, your team can standardize your data platform, avoid escalating costs, and tailor the solution to your specific needs, making it the preferred choice for code-first teams.
Want more ?
If you're looking to explore more about how companies are using dlt to build scalable, compliant, and cost-efficient data platforms, check out Taktile's case study. Taktile, a fintech company, achieved 20x cost savings and empowered their teams by migrating to a lightweight platform using dlt and Snowflake. Read the full case study to learn more about their journey, and contact our sales team to discuss how dlt can support your migration needs.