Normalization question
Date: 26 Jun 2002 11:51:26 -0700
Message-ID: <a386bae7.0206261051.6c87a148_at_posting.google.com>
I've remodeled an existing dataset that had a lot of repeating data. So far we've realized a 5:1 and 9:1 reduction in data volume by changing the model, so far so good.
But the question came up whether it'd be better to go back towards the previous data model that on modern (read: fast) machines may be good enough. Fair question, perhaps a tad premature, but I need to address it.
I could claim the efficacy of the new model is self-evident from the economy of scale in order-of-magnitude reductions, elimination of redundancy, improved data integrity, etc. But I also need to answer to the perspective of application model & process flow model.
Old Data Model (key is ID + Year + Month) :
ID Year Month Amount FlagA CodeA
0001 2002 01 $1000.00 Y ABCD
0001 2002 03 $1000.00 Y ABCD
0001 2002 02 $1000.00 Y ABCD
0001 2002 03 $1000.00 Y ABCD
0001 2002 04 $1100.00 Y DEFG
0001 2002 05 $1100.00 Y DEFG
0001 2002 06 $1100.00 Y ABCD
<keep adding new records>
- Year+Month records are collapsed into single records w/ date ranges, current active record has null (open) end date.
- To maintain functional uniqueness, I have to use a date-range trigger to prevent overlapping Begin_dt + End_dt values on the same Key.
- Previous record has to be 'closed out' by updating end_dt so that it is no longer the 'current' record.
- I've seen other models similar to this, like hotel reservation systems, where there are records indicating room reservation dates, incl. open (null) end dates. (Seems like SQL's kinda weak for these purposes - I've read that none of the main vendors have implemented the SQL 92 date data types that maybe might've made this a simpler problem....)
The questions I have are:
- Is begin_dt + end_dt (incl. trigger) more complex than necessary in the new model?
- Are the null end_dt values going to end up really hurting performance of the new model? Esp. if for query simplicity someone uses the COALESCE() function to replace the end_dt nulls with the current date via getdate()/current date system functions?
- Is the elimination of duplicates worth the increase in complexity brought by the new model?
- Is there a better way to model this?
I know if I went back to 1st principles and read some weighty tome on normalization I'd find an answer, but I'm taking a shot & see what more experienced folks have to say, save some time narrowing down what's best practice. My general research on this however has lead me to believe that using date range as part of a functional key would work best.
Thanks in advance,
--lee Received on Wed Jun 26 2002 - 20:51:26 CEST