Variations in Strategies for Implementing SCD2 Systems

In the realm of dimensional modeling, the choice of timestamps for Slowly Changing Dimension Type 2 (SCD2) tables plays a significant role in data accuracy, historical tracking, and usability in analytics. Here's a breakdown of the three main types of timestamps - extract timestamps, source system timestamps, and business timestamps - and their implications for SCD2 dimensional modeling.

Extract Timestamps

Using extract timestamps as validity boundaries in SCD2 tables reflects when the data was pulled into your data warehouse or ETL process. This approach is simpler to implement as it tracks the data processing time. However, it may not capture the actual time the data changed in the real world, leading to less accurate historical views, especially when source data updates are delayed or backfilled. Extract timestamps are useful for operational monitoring and debugging ETL freshness but less ideal for business analytics requiring true historical events.

Source System Timestamps

Source system timestamps reflect when the change or event actually occurred in the originating system. Using this timestamp for SCD2 validity periods ensures your dimensional history aligns closely with the real-world timing of changes, improving analytical correctness and temporal accuracy. However, it depends on the quality and reliability of the source timestamps. Issues like late-arriving or out-of-order events must be managed carefully, often by combining with an ingest timestamp for data freshness and watermarking.

Business Timestamps

Business timestamps are a more abstract or derived timestamp representing when a business event is deemed effective or relevant. Using business timestamps for SCD2 enables alignment with business processes and reporting periods rather than system events. This supports meaningful trend analysis and decision-making that matches business context but requires clear business rules and consistent definitions. Misalignment between business rules and actual system timestamps can cause discrepancies in historical reporting.

In summary, the implications for dimensional modeling when choosing among these timestamps for SCD2 tables include:

| Timestamp Type | Implications for SCD2 Dimensional Modeling | |----------------|--------------------------------------------------------------------| | Extract | Simpler to implement; tracks data processing time; less accurate for true historical changes; may cause misleading historical validity spans. Useful for operational audits. | | Source System | Reflects true event time; improves analytic accuracy and fidelity; requires handling of late or out-of-order data. Enables reliable historical trend analysis. | | Business | Aligns with business perspective of when changes matter; improves interpretability and decision relevance; requires strict business rules and governance. |

Additional considerations include the use of and columns in SCD2 tables to define the period a dimension record version is active, combining timestamps in streaming and large-scale data pipelines, and the impact of incorrect or inconsistent timestamp selection on trend analysis, reporting accuracy, and ETL complexity.

Carefully selecting the timestamp type based on data source reliability, business needs, and analytic goals is critical for SCD2 dimensional modeling success. This approach can be useful for building other dimension (dim) and fact tables, particularly if bitemporal history is important to your analytics capabilities.

With business timestamps, a firm date boundary can be provided for records, as they become invalid on specific dates. Backdating and post-hoc corrections require more thought when using business timestamps, as decisions need to be made about updating affected records or maintaining incorrect records for audit purposes. When the old record and the replacing record have equal timestamps, a strict inequality is required in query patterns.

Using source system timestamps can be especially valuable for source system history tables that aren't true dimensions. The choice between SCD2 and dimensional snapshots depends on factors like storage costs, complexity, and user preferences. The most common pattern for creating an SCD2 table is utilizing a date or timestamp in the data. Extract timestamps consider the landing of data in the warehouse to be the primary point of reference. Source system timestamps, on the other hand, consider the raw data to be valid when the source system created or updated it.