Normalization question

From: leebert <leebertarian_at_yahoo.com>
Date: 26 Jun 2002 11:51:26 -0700
Message-ID: <a386bae7.0206261051.6c87a148_at_posting.google.com>



I've remodeled an existing dataset that had a lot of repeating data. So far we've realized a 5:1 and 9:1 reduction in data volume by changing the model, so far so good.

But the question came up whether it'd be better to go back towards the previous data model that on modern (read: fast) machines may be good enough. Fair question, perhaps a tad premature, but I need to address it.

I could claim the efficacy of the new model is self-evident from the economy of scale in order-of-magnitude reductions, elimination of redundancy, improved data integrity, etc. But I also need to answer to the perspective of application model & process flow model.

Old Data Model (key is ID + Year + Month) :

ID     Year  Month   Amount  FlagA    CodeA
0001   2002   01   $1000.00    Y       ABCD
0001   2002   03   $1000.00    Y       ABCD
0001   2002   02   $1000.00    Y       ABCD
0001   2002   03   $1000.00    Y       ABCD
0001   2002   04   $1100.00    Y       DEFG
0001   2002   05   $1100.00    Y       DEFG
0001   2002   06   $1100.00    Y       ABCD
<keep adding new records>

New Data Model (key is ID + Begin_dt + End_dt, requires trigger for complete enforcement)

ID     Begin_Dt     End_Dt       Amount    FlagA  CodeA  
0001   2002-01-01   2002-03-31   $1000.00     Y   ABCD
0001   2002-04-01   2002-05-31   $1100.00     Y   DEFG
0001   2002-06-01   null         $1100.00     Y   ABCD

  1. Year+Month records are collapsed into single records w/ date ranges, current active record has null (open) end date.
  2. To maintain functional uniqueness, I have to use a date-range trigger to prevent overlapping Begin_dt + End_dt values on the same Key.
  3. Previous record has to be 'closed out' by updating end_dt so that it is no longer the 'current' record.
  4. I've seen other models similar to this, like hotel reservation systems, where there are records indicating room reservation dates, incl. open (null) end dates. (Seems like SQL's kinda weak for these purposes - I've read that none of the main vendors have implemented the SQL 92 date data types that maybe might've made this a simpler problem....)

The questions I have are:

  1. Is begin_dt + end_dt (incl. trigger) more complex than necessary in the new model?
  2. Are the null end_dt values going to end up really hurting performance of the new model? Esp. if for query simplicity someone uses the COALESCE() function to replace the end_dt nulls with the current date via getdate()/current date system functions?
  3. Is the elimination of duplicates worth the increase in complexity brought by the new model?
  4. Is there a better way to model this?

Those three questions don't really address some of other problems in the old data model though, like when 2 records for the same year and month needed to exist for the same "ID", functionally creating duplicate keys, result being that the budget analyst had to due some weird gyrations by imputing the 'correct' record through reading other data flags. In essence, the domain of the key changed depending upon the context of the type of query ... Agggh! Uniqueness = The whole record!

I don't know if I'd exactly describe this as a 2nd vs. 3rd NF issue, but just the fact they are conceding it *could* impact performance on a lesser/smaller machine, I think they might as well concede it'd impact a heavily used machine with multiple application databases (we're talking about Dell rack-mount 4-way blades FDDI'd to a SAN).

I know if I went back to 1st principles and read some weighty tome on normalization I'd find an answer, but I'm taking a shot & see what more experienced folks have to say, save some time narrowing down what's best practice. My general research on this however has lead me to believe that using date range as part of a functional key would work best.

Thanks in advance,

--lee Received on Wed Jun 26 2002 - 20:51:26 CEST

Original text of this message