Rob's high performance data warehousing rule #1
September 10, 2023
September 10, 2023

Rob's high performance data warehousing rule #1: if you cannot constrain a thing, you cannot ingest that thing.

If you cannot constrain a thing, you cannot ingest that thing.

This harkens back to the beginnings of relational modeling. They key is the key. If I cannot identify one employee from another, for instance, I cannot know what an employee is. We have to do some work before implementation, discussing with stakeholders, what they believe this "employee" is and how we identify one from the other.

These discussions will result in business rules, which should be almost compilable. As examples:

An employee can be naturally identified by their SSN or EIN.

An employee can be naturally identified by thier first name, last name, date of birth and location of birth.

An employee can have exactly one supervisor who must also be an employee with the exception of the CEO.

We can stack up a lot of these business rules on an entity like an employee. Historically, this was fairly straightforward to constrain as declarative constraint and declarative RI existed in our platforms.

With the advent of data lakes, and cloud data warehouses, these declarative constraints aren't available anymore. This doesn't let us off the hook, because if two records for the same employee get through, all heck is going to break loose.

So we must reactively constrain this data and alert/rollback ingestion batches accordingly. At the end of the batch or microbatch, we have to go check our work:

--should return 0 rows, if not, raise an error and roll back

--should return 1 for the ceo, if not, raise an error and roll back, notice, I've surrogated the employee key, but maintained the natural keys

OR employee_id = supervisor_id

Finally, if any errors do occur, log it in an error log. This will become invaluable when discussing warehouse quality at the end of the year.

Why go through all this trouble? Well, a stitch in time still saves nine. I don't have to worry so much about data quality and data observability processes/people/software if I know exactly what's in my database. If the warehouse can catch an error before the users can notice it, I can get it resolved much more quickly. As a second benefit, if you deploy your constraint code to dev before you write your ingestion system, you'll kill bugs a LOT faster.  

This pays off in much less unnecessary work for my data team, so I can apply them to revenue generating functions.

Read all the posts

Intrigued? Want to read some more?