Skip to content

Implementing Slowly Changing Dimension Type 2

This article presents an example implementation of SCD Type 2. It also explores the exceptional cases where updates occur in both driving and referential tables for a certain natural key.

Refer to Slowly changing dimensions for different types of SCDs with examples. The following table summarizes the common use case for each type.

SCD type Use case scenarios
SCD type 0 Durable data like constants, date dimensions
SCD type 1 Only current version of truth available, no need for historical data
SCD type 2 Need historical versions of data and the periods during which they were current
SCD type 3 Need for current data and the previous last value (alternate reality)
SCD type 4 Used when a group of attributes in a dimension rapidly changes and is split off to a mini–dimension (rapidly changing monster dimension.)
SCD type 5 Rarely used - to accurately preserve historical attribute values, plus report historical facts according to current attribute values; SCD 5 is equivalent to SCD 1 + SCD 4
SCD type 6 Rarely used - Unpredictable Changes with Single-Version Overlay; SCD 6 is equivalent to SCD 1 + SCD 2 + SCD 3
SCD type 7 Rarely used - Hybrid technique that supports both as-was and as-is reporting

Business Scenario

In this scenario, there are three delta tables that act as the source:

  1. employee
  2. employee_address
  3. employee_information

The aim is to create a single unified target table called target_employee that can provide a consolidated view of these tables. This presentation layer table also holds the versioning for all the three source tables.

The SCD Type 2 logic is implemented in a pyspark notebook. The following diagram provides the high-level architecture: