dltHub
Blog /

Slowly Changing Dimension Type2: Explanation and code

  • Aman Gupta,
    Junior Data Engineer
Check this Colab Notebook for a short and sweet demo.

What is a slowly changing dimension?

Slowly changing dimensions are a dimensional modeling technique created for historising changes in data.

This technique only works if the dimensions change slower than we read the data since we would not be able to track changes happening between reads. For example, if someone changes their address once in a blue moon, we will capture the changes with daily loads - but if they change their address 3x in a day, we will only see the last state and only capture 2 of the 4 versions of the address.

However, they enable you to track things you could not before such as

  • Hard deletes
  • Most of the changes and when they occurred
  • Different versions of entities valid at different historical times

What is Slowly Changing Dimension Type 2 (SCD2)? and why use it?

The Type 2 subtype of Slowly Changing Dimensions (SCD) manages changes in data over time. When data changes, a new record is added to the database, but the old record remains unchanged. Each record includes a timestamp or version number. This allows you to view both the historical data and the most current data separately.

Traditional data loading methods often involve updating existing records with new information, which results in the loss of historical data.

SCD2 not only preserves an audit trail of data changes but also allows for accurate historical analysis and reporting.

SCD2 applications

Colab demo

Use Case 1: Versioning a record that changes

In environments where maintaining a complete historical record of data changes is crucial, such as in financial services or healthcare, SCD Type 2 plays a vital role. For instance, if a customer's address changes, SCD2 ensures that the old address is preserved in historical records while the new address is available for current transactions. This ability to view the evolution of data over time supports auditing, tracking changes, and analyzing trends without losing the context of past information. It allows organizations to track the lifecycle of a data entity across different states.

Here's an example of the customer address change.

Before: Original Address

After: Address Updated

In the updated state, the previous address record is closed with a _dlt_valid_to timestamp, and a new record is created with the new address "456 Oak Ave" effective from May 2, 2024. The NULL in the _dlt_valid_to field for this new record signifies that it is the current and active address.

Use Case 2: Tracking deletions

This approach ensures that historical data is preserved for audit and compliance purposes, even though the record is no longer active in the current dataset. It allows businesses to maintain integrity and a full historical trail of their data changes.

State Before Deletion: Customer Record Active

State after deletion: Customer record marked as deleted

In this updated table, the record that was previously active is marked as deleted by updating the _dlt_valid_to field to reflect the timestamp when the deletion was recognized, in this case, June 1, 2024, at 10:00 AM. The presence of a non-NULL _dlt_valid_to date indicates that this record is no longer active as of that timestamp.

Learn how to customise your column names and validity dates in our SDC2 docs.

Surrogate keys, what are they? Why use?

Every record in the SCD2 table needs its own ID. We call this a surrogate key. We use it to identify the specific record or version of an entity, and we can use it when joining our fact tables for performance (as opposed to joining on entity ID + validity time).

Simple steps to determine data loading strategy and write disposition

This decision flowchart helps determine the most suitable data loading strategy and write disposition:

  • Is your data stateful? Stateful data is subject to change, like your age. Stateless data does not change, for example, events that happened in the past are stateless.
    • If your data is stateless, such as logs, you can just increment by appending new logs.
    • If it is stateful, do you need to track changes to it?
      • If yes, then use SCD2 to track changes.
      • If no,
        • Can you extract it incrementally (new changes only)?
          • If yes, load incrementally via merge.
          • If no, re-load fully via replace.

Below is a visual representation of the steps discussed above:

Determining Write Disposition

Conclusion

Use SCD2 where it makes sense but keep in mind the shortcomings related to the read vs update frequency. Use dlt to do it at loading and keep everything downstream clean and simple.

Want to discuss? Join the dlt slack community!