Listen to this article
One of the issues many coming into the data warehouse world have difficulty with is managing time variance and non volatility at scale and efficiency. One of the basic methods is versioning denormalized structures with a start date and end date in to what are commonly called "slowly changing dimensions", but this can have adverse affects. Denormalization threatens the warehouse with update anomalies and joining tables using start dates and end dates is complex, which puts more load on infrastructure and staff. Timeseries database platforms are available, but often restrict how time based relations can be queried.
One method of implementing time variance that resolves many of these issues leverages surrogation over time at the individual relation level. This accellorates queries as they become very simple, and can be used with high relational fidelity.
For the sake of simplicity, only uni-temporal workloads will be addressed here, though these methods can be expanded to multitemporal management by adding to the key.
Consider this table exists in a transactional system and we want to integrate it into the data warehouse:
The first step is to export this table from the source system to a filesystem somewhere that it can be ingested into the data warehouse. A number of methods are available, but for this document we're going to assume the whole table is snapshot to the filesystem rather than depending on a log based CDC mechanism. Given that we're snapshotting, we'll make it easier by appending a timestamp to each row in the snapshot:
select *, current_timestamp as exporttime from suppliers;
Once exported, we'll import it directly into a staging table with a very similar schema to the source in the data warehouse.
Our destination table is going to look similar, but we're going to add some metadata specific to managing temporality:
We've added:
- SupplierDWID, which will become the temporal surrogate. It will be unique to each supplier for each time that supplier changes.
- dw_starttime and dw_endtime wich tell us the time window this supplier was in a given state
- dw_row_hash which will be used to quickly compare rowstates at ingestion
First load is very straight forward, as there's no previous records to compare to. One issue on first load is you don't have any history, you only know the current state of each row, there's no way to know how long they've been in that state. So we have to make an assumption that they've ALWAYS been in this state. The other thing to remember is they're still in that state, so the dw_endtime can't be populated with a known value. You've got two options: assume it's a seed date sometime way in the future, or leave it nullable. Either works.
Once loaded, each supplier/starttime will have a new surrogate dw_id. You'll also likely want to add rows to the table to be used in child relations when you want to represent what would be a null. That way we can differentiate between the types of null (unknown, not applicable etc.)
From there, continuous ingestion is very similar to the first load with a couple of added steps:
When a new export is available, truncate and reload the staging table.
First step is to deal with those records that have been deleted or updated, we'll set the "dw_endtime" to the last export time from the source system:
--find all the rows in suppliers where there is no stage_supplier with the same row hash and set the end time
Last, insert the new and temporally changed rows:
We now have a fully temporal surrogated relation. We can repeat the exact same process with child relations, the only difference is that we add the parent temporal surrogate to the child's key AND include it in the row_hash. Once this is done, cascading changes through the relations is automatic. If a supplier changes, all of it's products will change automatically and temporal integrity is enforced throughout the entire schema.
Product would go from
To
The one catch with child relations is you'll want to add the parent DWID to the row hash. We'll continue to cascade the surrogates to every child of a child of a child.
The result is a dramatic reduction of complexity at query time. If I want to make it easy for users to see "current", we can just cut a view for each base table like:
And they can treat it like an operational data store.
If you want to get results over time in a query, just join on the DWIDs rather than a compound join on start_dates and end_dates.