Slowly Changing Dimension Type2: Explanation and code
- Aman Gupta,
Jr. Data Engineer
What is a slowly changing dimension?
Slowly changing dimensions are a dimensional modeling technique created for historising changes in data.
This technique only works if the dimensions change slower than we read the data since we would not be able to track changes happening between reads. For example, if someone changes their address once in a blue moon, we will capture the changes with daily loads - but if they change their address 3x in a day, we will only see the last state and only capture 2 of the 4 versions of the address.
However, they enable you to track things you could not before such as
- Hard deletes
- Most of the changes and when they occurred
- Different versions of entities valid at different historical times
What is Slowly Changing Dimension Type 2 (SCD2)? and why use it?
The Type 2 subtype of Slowly Changing Dimensions (SCD) manages changes in data over time. When data changes, a new record is added to the database, but the old record remains unchanged. Each record includes a timestamp or version number. This allows you to view both the historical data and the most current data separately.
Traditional data loading methods often involve updating existing records with new information, which results in the loss of historical data.
SCD2 not only preserves an audit trail of data changes but also allows for accurate historical analysis and reporting.
SCD2 applications
Use Case 1: Versioning a record that changes
In environments where maintaining a complete historical record of data changes is crucial, such as in financial services or healthcare, SCD Type 2 plays a vital role. For instance, if a customer's address changes, SCD2 ensures that the old address is preserved in historical records while the new address is available for current transactions. This ability to view the evolution of data over time supports auditing, tracking changes, and analyzing trends without losing the context of past information. It allows organizations to track the lifecycle of a data entity across different states.
Here's an example of the customer address change.
In the updated state, the previous address record is closed with a _dlt_valid_to
timestamp, and a new record is created with the new address "456 Oak Ave" effective from May 2, 2024. The NULL in the _dlt_valid_to
field for this new record signifies that it is the current and active address.
Use Case 2: Tracking deletions
This approach ensures that historical data is preserved for audit and compliance purposes, even though the record is no longer active in the current dataset. It allows businesses to maintain integrity and a full historical trail of their data changes.
In this updated table, the record that was previously active is marked as deleted by updating the _dlt_valid_to
field to reflect the timestamp when the deletion was recognized, in this case, June 1, 2024, at 10:00 AM. The presence of a non-NULL _dlt_valid_to
date indicates that this record is no longer active as of that timestamp.
Learn how to customise your column names and validity dates in our SDC2 docs.
Surrogate keys, what are they? Why use?
Every record in the SCD2 table needs its own ID. We call this a surrogate key. We use it to identify the specific record or version of an entity, and we can use it when joining our fact tables for performance (as opposed to joining on entity ID + validity time).
Simple steps to determine data loading strategy and write disposition
This decision flowchart helps determine the most suitable data loading strategy and write disposition:
- Is your data stateful? Stateful data is subject to change, like your age. Stateless data does not change, for example, events that happened in the past are stateless.
- If your data is stateless, such as logs, you can just increment by appending new logs.
- If it is stateful, do you need to track changes to it?
- If yes, then use SCD2 to track changes.
- If no,
- Can you extract it incrementally (new changes only)?
- If yes, load incrementally via merge.
- If no, re-load fully via replace.
- Can you extract it incrementally (new changes only)?
Below is a visual representation of the steps discussed above:
Conclusion
Use SCD2 where it makes sense but keep in mind the shortcomings related to the read vs update frequency. Use dlt to do it at loading and keep everything downstream clean and simple.
Want to discuss? Join the dlt slack community!