Implementing Slowly Changing Dimension Type 2¶
This article presents an example implementation of SCD Type 2. It also explores the exceptional cases where updates occur in both driving and referential tables for a certain natural key.
Refer to Slowly changing dimensions for different types of SCDs with examples. The following table summarizes the common use case for each type.
SCD type | Use case scenarios |
---|---|
SCD type 0 | Durable data like constants, date dimensions |
SCD type 1 | Only current version of truth available, no need for historical data |
SCD type 2 | Need historical versions of data and the periods during which they were current |
SCD type 3 | Need for current data and the previous last value (alternate reality) |
SCD type 4 | Used when a group of attributes in a dimension rapidly changes and is split off to a mini–dimension (rapidly changing monster dimension.) |
SCD type 5 | Rarely used - to accurately preserve historical attribute values, plus report historical facts according to current attribute values; SCD 5 is equivalent to SCD 1 + SCD 4 |
SCD type 6 | Rarely used - Unpredictable Changes with Single-Version Overlay; SCD 6 is equivalent to SCD 1 + SCD 2 + SCD 3 |
SCD type 7 | Rarely used - Hybrid technique that supports both as-was and as-is reporting |
Business Scenario¶
In this scenario, there are three delta tables that act as the source:
The aim is to create a single unified target table called target_employee that can provide a consolidated view of these tables. This presentation layer table also holds the versioning for all the three source tables.
The SCD Type 2 logic is implemented in a pyspark notebook. The following diagram provides the high-level architecture:
Understand applied concepts and capabilities¶
Here are the key concepts and delta table capabilities that are used for this use case.
Using driving and referential tables¶
- The driving table represents the primary or key source table from which most of the final entity's information is derived. In this case,
employee
table is the driving table. - The referential tables act as additional sources, providing supplementary details to enrich the target entity. In this case,
employee_address
andemployee_information
tables are the referential tables.
Using change data feed¶
Change data feed allows Azure Databricks to track row-level changes between versions of a Delta table. When enabled on a Delta table, the runtime records change events for all the data written into the table.
For this use case, the change data feed has been enabled for all the three source tables.
How to use delta table properties¶
Here are some other characteristics of Delta tables that are used in this use case:
- Each operation that modifies a Delta Lake table creates a new table version.
- Delta tables offer "time travel" capabilities, allowing users to query a table's previous state based on timestamp or table version.
- The historic information can be used to audit operations, rollback a table, or query a table at a specific point in time using time travel.
Use sample data¶
The use case is explained based on the following sample data.
How employee sample data is used¶
employee_id first_name last_name email create_update_date
---------------------------------------------------------------------------------------
100 Steven King steven.king@contoso.com 2023-01-27
101 Neena Kochhar neena.kocchar@contoso.com 2023-02-09
102 Lex De Haan lex.de-haan@contoso.com 2023-04-07
103 Alexander Hunold alexander.hunold@contoso.com 2023-02-24
104 Bruce Ernst bruce.ernst@contoso.com 2023-04-24
How employee address sample data is used¶
employee_id city region street_address country create_update_date
--------------------------------------------------------------------------------------------
100 New York NY 123 Main St United States 2023-01-27
101 London England 456 Park Ave United Kingdom 2023-02-09
102 Paris ële-de-France 789 Rue de la Paix France 2023-04-07
103 Tokyo Tokyo 1-2-3 Shibuya Japan 2023-02-24
104 Sydney NSW 456 George St Australia 2023-04-24
How employee information is used¶
employee_id salary is_fte is_remote employment_date create_update_date
---------------------------------------------------------------------------
100 5000 True False 2020-01-15 2023-01-27
102 5500 True False 2021-03-22 2023-04-07
101 6000 True True 2019-05-10 2023-02-09
103 5200 True True 2018-11-30 2023-02-24
104 5800 True False 2017-09-12 2023-04-24
How target employee sample data is used¶
id employee_id first_name last_name email city region street_address country salary is_fte is_remote employment_date employee_create_update_date address_create_update_date info_create_update_date address_commit_version info_commit_version employee_commit_version valid_flag valid_from valid_to
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
d4d0569aa28e28d83a3d9a8888a30a27 101 Neena Kochhar neena.kocchar@contoso.com London England 456 Park Ave United Kingdom 6000 True True 2019-05-10 2023-02-09 2023-02-09 2023-02-09 1 1 1 1 2023-02-09 9999-12-31
be613e2a8166f377f65967bd9073188d 102 Lex De Haan lex.de-haan@contoso.com Paris ële-de-France 789 Rue de la Paix France 5500 True False 2021-03-22 2023-04-07 2023-04-07 2023-04-07 1 1 1 1 2023-04-07 9999-12-31
5a60257eabea5111d4e8aa0cf56fb109 103 Alexander Hunold alexander.hunold@contoso.com Tokyo Tokyo 1-2-3 Shibuya Japan 5200 True True 2018-11-30 2023-02-24 2023-02-24 2023-02-24 1 1 1 1 2023-02-24 9999-12-31
014830e89472a8ac527a5c6b2c1e5fe5 100 Steven King steven.king@contoso.com New York NY 123 Main St United States 5000 True False 2020-01-15 2023-01-27 2023-01-27 2023-01-27 1 1 1 1 2023-01-27 9999-12-31
3b2c89840dc66c48dedbbb8ecac15ada 104 Bruce Ernst bruce.ernst@contoso.com Sydney NSW 456 George St Australia 5800 True False 2017-09-12 2023-04-24 2023-04-24 2023-04-24 1 1 1 1 2023-04-24 9999-12-31
The target target_employee
table is created from driving employee
table and the referential employee_address
and employee_information
tables. There are following additional columns for tracking changes from each of these source tables:
employee_create_update_date
: Tracks the created/updated date of the record for the employee table.address_create_update_date
: Tracks the created/updated date of the record for the address table.info_create_update_date
: Tracks the created/updated date of the record for the information table.employee_commit_version
: Tracks the commit version of the record from the employee table.address_commit_version
: Tracks the commit version of the record from the address table.info_commit_version
: Tracks the commit version of the record from the information table.valid_flag
: Tracks if the record is the latest for that natural key or a stale record.valid_from
: Tracks from when the record is active.valid_to
: Tracks to when the record is active.
Implement handled scenarios¶
There are two scenarios that are handled in this use case:
Scenario 1: multiple updates¶
In this scenario, data for a particular natural key is updated in both the driving and referential tables.
To implement SCD Type 2 for this scenario, a two-step process is followed.
Step 1: Separate capture of updates¶
When an update occurs in the driving table, it is captured separately and marked with an appropriate version number, signifying the change. Similarly, when a change happens in any of the referential tables, it is also captured independently, retaining its own version number.
Step 2: union of updates¶
To obtain a comprehensive view of all the updates made to a particular record, a union of the updates from both the driving and referential tables is performed. This union ensures that all relevant changes are accounted for, and the historical chain of updates remains intact.
Scenario 2: simultaneous updates¶
In certain exceptional cases, both the driving and referential tables might receive updates for the same natural key simultaneously. To maintain the integrity of historical data, the proposal is to create a historical chain of updates.
In this scenario, the driving table's update is treated as the primary update, and the relevant referential table updates are linked to it in a chronological order. This approach preserves the historical context and provides a comprehensive view of all changes made to the record.
Consider a scenario with the following updates:
-- Apply some updates to "employee" table.
update employee set first_name = 'Steven2', dt = '2023-07-13' where employee_id = '100';
update employee set first_name = 'Steven3' , dt = '2023-07-17' where employee_id = '100';
-- Apply some updates to "employee_address" table.
update employee_address set city = 'Lille', dt = '2023-07-13' where employee_id = '102';
update employee_address set city = 'NewCastle', dt = '2023-07-14' where employee_id = '104';
update employee_address set street_address = '456 Baker Street', dt ='2023-07-16' where employee_id = '100';
-- Apply some updates to "employee_information" table.
update employee_information set salary = 7000, dt = '2023-07-17' where employee_id = '108';
update employee_information set salary = 8000, dt = '2023-07-16' where employee_id = '102';
update employee_information set is_fte = False, dt = '2023-07-15' where employee_id = '104';
Understand the implementation¶
Here is the high-level implementation logic for this use case:
- Gather the changes (updates and inserts) that have arrived in all three tables.
- Read the commit versions stored in
target_employee
table foremployee
,employee_address
andemployee_information
tables and get the greatest value. - Get the changes that have a commit version greater than the calculated value in the previous step. Do it for all of the three tables.
- Create a change dataframe for each source table.
- Read the commit versions stored in
- Create a dataframe capturing referential changes from
employee_address
andemployee_information
tables.- read the entire driving table
employee
to get all natural keys. - Perform a join with change dataframe that captures updates for records of employee_address and employee_information tables.
- Create a new column
changed_commit
which will hold a value, if an update has occurred for that natural key in either address or information table. - Filter out the records with natural key for which an update has happened using
changed_commit
column.
- read the entire driving table
- Create a dataframe capturing driving table changes.
- read the updates of the driving table and join it with changes from the address and information tables.
- Unite the driving and referential table changes to get all the changes.
- Apply the merge logic.
Choose methods to track SCD type 2¶
The solution describes two methods of tracking SCD type 2 for the target entity.
Method 1: Capture the latest updates only¶
In this method, if multiple updates are coming from sources for each record, only the latest updates are captured. To get all the updates, the following logic is applied:
spark.sql("""
select *
from (select *,
row_number() over (partition by employee_id
order by changed_commit desc) as rank
from changes_employee_target)
order by
employee_id,
changed_commit desc
""")
And then, the latest updates are captured using the following query:
Consider the scenario, where for a certain natural key, an update has happened in the employee_address
table but not in other tables. While capturing the changes, the columns corresponding to other tables will show as null that is not a right representation of the update. The record will look like this:
employee_id first_name last_name email city region street_address country salary is_fte is_remote employment_date employee_create_update_date address_create_update_date info_create_update_date address_commit_version info_commit_version employee_commit_version valid_flag valid_from valid_to changed_commit rank
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
100 Steven3 King steven.king@contoso.com New York NY 456 Baker Street United States null null null null 2023-07-17 2023-07-13 null 4 null 4 1 2023-07-17 9999-12-31 4 1
To solve the above issue, the coalesce function is used to capture values of columns in the target_employee
table that have not changed. This approach represents the more accurate view of the record.
Note that the above approach does not cover the edge case where the actual value being updated for a column is Null
.
After coalescing, the record looks like this:
employee_id first_name last_name email city region street_address country salary is_fte is_remote employment_date employee_create_update_date address_create_update_date info_create_update_date address_commit_version info_commit_version employee_commit_version valid_flag valid_from valid_to _commit_version
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
100 Steven3 King steven.king@contoso.com New York NY 456 Baker Street United States **5000** **True** **False** 2020-01-15 2023-07-17 2023-07-13 2023-01-27 4 1 4 1 2023-07-17 9999-12-31 4
The values for columns is_fte, is_remote, and salary are retained as these column values have not changed.
Below, only the latest update has been considered. The date of job run ("2023-08-03") is used to derive valid_from
and valid_to
columns.
employee_id first_name last_name email city region street_address country salary is_fte is_remote employment_date employee_create_update_date address_create_update_date info_create_update_date address_commit_version info_commit_version employee_commit_version valid_flag valid_from valid_to
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
100 Steven King steven.king@contoso.com New York NY 123 Main St United States 5000 True False 2020-01-15 2023-01-27 2023-01-27 2023-01-27 1 1 1 0 2023-01-27 2023-08-11
100 Steven3 King steven.king@contoso.com New York NY 456 Baker Street United States 5000 True False 2020-01-15 2023-07-17 2023-07-13 2023-01-27 4 1 4 1 2023-08-11 9999-12-31
Method 2: Historical chaining of records¶
In this method, all the updates that happen for a certain id
are captured and a chain of history is maintained for those updates.
The lead function takes the next start date for the same natural key, ordered by date (dt). The row_number function ranks the record in the order of modification in the source table.
lead(greatest(employee_create_update_date, address_create_update_date, info_create_update_date))
over (partition by employee_id
order by employee_create_update_date) as valid_to,
---take the greatest value of dates
row_number()
over (partition by employee_id
order by employee_create_update_date) rn
- Store all the changed records in a table called changes_history.
- Update
valid_flag=1
andvalid_to= <infinite_end_date>
for highest ranking record. - Proceed to close the
current
record in target table using update sql operation.
update target_employee as t
set valid_to = (
select first(valid_to) --- rn=1 stored in changes_history corresponds to original record in target table . Use the calculated valid_to to hence update original record
from changes_history as c
where c.employee_id = t.employee_id
and c.rn = 1),
valid_flag = 0
where employee_id IN (
select employee_id
from changes_history
where rn = 1)
---latest records now become stale records
and valid_flag = 1;
- Insert the records into target table having rank > 1.
Below is an example of how the historical chain of updates is maintained for a natural key. Here, all the prior updates are considered to maintain a chain of history. The valid _from
and valid_to
columns consider the dt
column in the source that indicates when the row was added or updated (in this case 2023-07-13, 2023-07-17).
id employee_id first_name last_name email city region street_address country salary is_fte is_remote employment_date employee_create_update_date address_create_update_date info_create_update_date address_commit_version info_commit_version employee_commit_version valid_flag valid_from valid_to
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
5e64f88f256b598f6c8a801b8d560285 100 Steven2 King steven.king@contoso.com New York NY 456 Baker Street United States 5000 True False 2020-01-15 2023-07-13 2023-07-13 2023-01-27 4 1 3 0 2023-07-13 2023-07-17
08f663dec02e58679ba6e6af5cf67882 100 Steven3 King steven.king@contoso.com New York NY 456 Baker Street United States 5000 True False 2020-01-15 2023-07-17 2023-07-13 2023-01-27 4 1 4 1 2023-07-17 9999-12-31
014830e89472a8ac527a5c6b2c1e5fe5 100 Steven King steven.king@contoso.com New York NY 123 Main St United States 5000 True False 2020-01-15 2023-01-27 2023-01-27 2023-01-27 1 1 1 0 2023-01-27 2023-07-13