dltHub
Blog /

SCD2 Deep Dive with dlt: How nested data affects queries and costs

  • Aman Gupta,
    Data Engineer

Intro

You're likely familiar with Slowly Changing Dimensions (SCDs), a staple for tracking historical data in a warehouse. But things get complicated when your data isn't in neat, flat rows. How do you handle historical changes buried deep within nested JSON records?

Manually tracking these changes is a recipe for complex, error-prone SQL and versioning nightmares. This is where the dlt library comes in. It automates this entire workflow, allowing you to manage historical changes without writing custom SQL or managing versioning yourself.

For those who need a quick refresher, SCD Type 2 (SCD2) is like a time machine for your data warehouse. ⏳ Instead of overwriting a record when it changes, you insert a new version and mark the old one as historical. This lets you rewind and see exactly how your data evolved over time.

In this article, we'll break down exactly how dlt handles this process. We'll show you a practical example, peek under the hood at the SQL it generates, and share real-world benchmarks to analyze the cost trade-offs of different strategies.

What is a nested SCD2 and how does it work?

Let’s look at an example implementation. We will then be able to discuss how to configure this implementation to achieve different results.

Step 1: Resource definition

@dlt.resource(
    merge_key="customer_id",
    max_table_nesting=2,
    write_disposition={"disposition": "merge", "strategy": "scd2"}
)
def customers():
    yield {
        "customer_id": 1,
        "name": "John Doe",
        "profile": [
            {
                "email": "john@example.com",
                "address": [
                    {"city": "New York", "zipcode": "10001"}
                ]
            }
        ]
    }
Note: We are using the strategy “scd2” above.

Step 2: dlt flattens the nested structure

  • customers
  • customers__profile
  • customers__profile__address

To learn more how dlt normalizes table schema, refer to the docs.

Step 3: dlt generates the SQL

When implementing the SCD2 strategy, dlt adds two fields to the root table: valid_from and valid_to. These track the validity window of each record.

Any change, whether at the root or inside a nested field, creates new rows across all related tables. These tables stay connected through _dlt_id and _dlt_parent_id.

Want the full picture? Here’s a deep dive on how nesting works. Any change, anywhere in the nested structure, creates a new version of the entire record hierarchy. That means:

  • Root changes → new rows in all nested tables
  • Nested changes → new row in the root table
  • Every insert → full table scan for deduplication

Let’s see this in action, here’s the SQL dlt generates:

For the root table:

-- Retire old version (root table)
UPDATE customers SET _dlt_valid_to = timestamp
WHERE _dlt_valid_to IS NULL AND hash NOT IN (staging);

-- Insert new version (root table)
INSERT INTO customers SELECT * FROM staging WHERE hash NOT IN (existing);

This is repeated across nested tables:

-- Insert new version (nested: profile)
INSERT INTO customers__profile SELECT * FROM staging_profile WHERE hash NOT IN (SELECT hash FROM customers__profile);

-- Insert new version (nested: address)
INSERT INTO customers__profile__address SELECT * FROM staging_address WHERE hash NOT IN (SELECT hash FROM customers__profile__address);

"The deeper the nesting, the more tables and respective scans you'll need"

Putting SCD2 strategies to test in BigQuery

We wanted to see how different SCD2 strategies actually perform with nested and changing data (with an original nesting depth of 4). So, we ran a few experiments with dlt.

Before jumping into performance, we built an interactive Colab to break down the four core SCD2 variations. Using simple datasets, you can see exactly how each approach filters, updates, and retires rows.

What we tested:

  • SCD2 with merge key – non-incremental extraction → reloads all the data every run.
  • SCD2 with merge key – incremental extraction → only pulls new or updated rows each run.
  • SCD2 with merge key, incremental extraction, max_nesting = 1 (original depth = 4).
  • SCD2 with merge key, incremental extraction, max_nesting = 2 (original depth = 4).

How we stress‑tested them:

  • Initial load: 1 million records with four levels of nesting.
  • Update 1: Adding 100k new rows and modifying another 100k existing rows. A moderate update cycle.
  • Update 2: Adding 200k new rows and modifying another 200k existing rows. A heavier update cycle.
    (Real life is messier, but this gets close enough.)

Test results:

  • Incremental is cheaper. Switching from non‑incremental to incremental cut query bytes by 25–35% on updates. And query bytes saved equals bill money saved.
  • Nesting depth tweaks are minor. Reducing nesting from 4 levels to 1–2 only slightly changed query bytes processed in our runs. Your results may vary depending on schema design and data patterns.

Wrapping up

dlt already does the heavy lifting when implementing SCD2 for you. You just point to the source, and dlt handles the rest.

We’ve shown you how dlt wrangles SCD2 and what the numbers look like. Now it’s your turn: kick the tires in this Colab notebook and see it in action.

If you need to handle more complex data scenarios, our advanced course takes you through the deep dives.

Keep your data pipelines simple and your costs low with dlt!