data modeling confessions, part 1

From: Sampo Syreeni <decoy_at_iki.fi>
Date: Mon, 19 Oct 2009 18:29:23 -0700 (PDT)
Message-ID: <e823d71f-fba3-456f-9406-f79978698491_at_p23g2000vbl.googlegroups.com>



I'm no Catholic, but I've still sinned my part in modeling data, and need my absolution.

The easiest mistake I've bumped into thus far is to disregard the presence of nulls. They're the kind of vice you should know, but which you still don't mind enough. They just sort of slip through your fingers.

My vice was the basic temporal table. You have the start time, and the end time, then you have the rest of the key attached to those, plus finally the dependent data, nicely normalized. You'd think you're okay, and off to well-normalized salvation. Except that you let those pesky nulls pass onto the end date. Nothing sinister in that -- after all, the data that is current does not *have* an end date. The the null is *such* a convenient marker for this very special case. Perfectly suited for the occasion; and as it happens, for ending up in the data purgatory.

All of the sudden your code is littered with is-null statements or the converse.

You're fighting to keep the special cases in control when summarizing your table.

Both entity and referential integrity are out of the window, since nulls and keys don't really mingle. In the worst case, if you happen to be repentant, you're already in limbo with null-minded triggers.

As Oracle at least doesn't index nulls, your carefully tuned table refuses to perform with the most common online query, which is to ask for the current status of a single entity that is currently valid...having a null end time. Try as you might, indexing only makes things worse since you want to index a null, and the DBMS just won't comply.

Not to mention the fact that when you optimize your indexes for online queries, suddenly you need another set of entirely different indexes for the historical data. 'Cause it surely don't get hit by OLTP stuff. It's all about temporal analytics, which translates to stabbing or intersection queries on the start and end dates. This time onto definite ones -- which of course cannot be satisfied by indexes well enough, because of the possibility of a null popping up somewhere, and the fact that just about nobody besides yourself knows how to exclude that sort of behavior in logical level SQL. Suddenly your OLAP load starts causing entire table scans, and pays no nevermind to physical level optimization.

There was a point, once, where you could have intervened. That was when you allowed the null in the end of validity column. Had you just realized the hint that null/tuple mark was giving you, you might have realized that you're actually trying to model two different kinds of data altogether: current, transactional data which is mostly accessed by key-absent-time, and properly indexed as such, and then historical data which is mostly accessed for analytical, statistical purposes, based on time. That the presence of the null actually suggested this might be the case, from the very start, and that the special nature of nulls suggested indexing would go amiss from the start. Not to mention the hassle in maintaining the queries, the views, the constraints, and so on. But no, you made your choice and now you're in data hell, with the level steadily descending towards ASCII-in-an-unformatted- flatfile.

I've committed the sin of "current data has a null end date" something like three to four times. Thus far it's only bitten me back once. But it's still one evil construct, alluring, far-reaching and just plain wicked, all at once. It'll definitely return to haunt me in the future. Don't go there, then. Separate the current and historical data, at least for the most part, and if you don't, at least do not use null as the separating factor. Received on Tue Oct 20 2009 - 03:29:23 CEST

Original text of this message