Listen to this article
It's no secret, I'm not a fan of dimensional modeling. It exists to solve physical problems, not logical problems.
I'd suggest One Big Table modeling takes this to the extreme, and doesn't deliver.
But why do we engage in these things? Our data platforms just don't deliver. Joins are expensive, they're moving too much data around, and the result is we can't query atomic data efficiently. So we sacrifice logical correctness for physical advantage. We want to see that query go fast more than we want to defend the logical correctness of the data itself.
The side effects are massive, data quality issues, lost data engineering staff hours, the list goes on. It's costly to develop the transforms necessary to deliver an OBT solution, both directly and indirectly.
I decided to put my money where my mouth is, and prove out the inefficiencies given currently available platforms, and OBT modeling.
In this test case, I have three tables. The subjet matter is the data collected from a car racing game.
We have a Playstats relation, which collects statistics on individual plays. How fast they're going, did they crash etc. It's sizeable at over a billion rows.
Then we have players, with attributes of those playing the game. About 6000 rows.
Last, tournaments, which is a collection of individual games of the same type. About 157 rows.
Players play games, their actions are collected in playstats.
Games may or may not be part of a tournament.
We quickly leave the star schema world and get a bit more relational.
The business asks "I want to see crashes, where there's no game error, by tournament, age category and speed/10 as a bucket." They're looking to make the game more fair by age group. I have no idea why, I just run the data.
The resultant query is:
On any partition based CDW (that's most of them), this is going to be problematic. The query's going to run for a while. So we ask data engineering to create a summary pipeline, burning a week because they have to deal with edge cases, evading update anomalies etc. The root of that looks something like:
This increases ingestion time by 14%, but sets up the data nicely for querying. I did this on Firebolt, but didn't take advantage of any of the indexing partition based platforms don't have, as a baseline. Now we can run the query off one big table:
It takes 36.40 seconds, and reads through 164.96 GB of data to generate the result.
If, instead, I'm on Firebolt and just generate the indexes necessary to support the initial query instead of secondary transformations, I burn an entire 5 minutes in staff hours, I don't have to change the initial query, and it runs in 0.51 seconds, scanning only 115.96 MB of data with virtually no impact to ingestion time. A 66x improvement beyond the best result possible with OBT on other platforms, and I get all those staff hours back to do more productive things.
Normalization isn't the problem. Our data platforms are.