Listen to this article
Delete nothing, update only metadata.
November 5, 2023
November 5, 2023

Rob's high performance data warehousing rule #4: Delete nothing, update only metadata.

Listen to this article

Yeah, I know, that's asking a lot. But one of the advantages of a data warehouse is the non volatility. It gives you and your organization powers your competitors don't.

Physically, this means some work. All entity relations must be temporal. We need to add time to the primary key. This makes these relations MUCH larger than if they were atemporal. This also means we need to ledger our transaction relations. If a sales order was $20, and it was adjusted to $10, now we need to ledger out a row at -$10. It's a pain.

But the value is in the subtle details. If Jeff is paid a bonus on his team's performance, and his team member Joe transfers to another team in the middle of the month, he doesn't lose his bonus.

More importantly, we know the state of all of our customers at any given time. If their behavior changes, we have the ability to look at their state over time and see if state changes are causal in behavior. If a customer moves from Montana to Utah, this is likely going to have an effect on their behavior.

From a physical perspective, updates and deletes are expensive. Every time you do one, you have to drag a partition/part/page from storage, uncompress it, modify it, compress it and send it back to storage. I've seen warehouse deployments where updates and deletes are over 30% of total warehouse costs. Far exceeding the storage costs necessary for temporal relations and ledgers.

And last, non volatile systems are easier to manage. If something goes wrong, we have the option (not that it should ever happen) of scuttling the entire warehouse and rebuilding it from files. We can also rest easy that the results will be predictable.

If you can manage time, you can put your competition in a really tough spot.

Read all the posts

Intrigued? Want to read some more?