Re: Nearest Common Ancestor Report (XDb1's $1000 Challenge)

From: Hugo Kornelis <hugo_at_pe_NO_rFact.in_SPAM_fo>
Date: Thu, 27 May 2004 22:32:25 +0200
Message-ID: <gkhcb01258161aqgnekd5qj4qc0qj71imo_at_4ax.com>


On 26 May 2004 16:36:38 -0700, Neo wrote:

Hi Neo,

>> > > Here are Date's formal definitions for the five normal forms ...
>> >
>> > The problem ... is that they are RM specific.
>>
>> Exactly how is that a "problem"?
>
>The problem is, when I used the term normalized, I meant it in the
>more general sense which can be applied to any data model [replace
>redundant data with refs (system supported links that are unrelated to
>data ie IDs in RM) to the one and only original within db). But I can
>see how the term caused confusion.

Well, when you set a challenge, it's not what you *think* that counts, but what you *write*. And what you wrote didn't cause any confusion to me:

	"(...) using the relational model (...) from normalized and
	 NULL-less data (...)"

I don't see how such a clear statement might cause confusion.

>> You didn't set a challenge to make a report using a model that is
>> normalized according to XDb1's own definition of normalisation and then
>> kludged in an RDBMS.
>
>XDb1 doesn't define the general form of normalization. Neither does
>RM. But the general form of normalization applies to all data models.
>I want RM's solution to be normalized in the general sense of the word
>(which apparently is a superset of the 5/6 "normal forms"). Even C. J.
>Date states "the purpose of such reduction is to avoid redundancy" in
>his chapter titled "Further Normalization I: 1NF, 2NF, 3NF, BCNF".

That is the purpose indeed. And boy, lots and lots of redundancy gets removed when you take an unnormalized design and drill down through all Date's normal forms. So the method proscribed by Date seems to fit the stated purpose well.

It would be more convincing if you could produce a quote from Date where he says that "normalization is not complete until you can be assured that each character of the alphabet gets stored at most one time, but pointers may be stored as often as needed". Think you can find something like that in Date's writings?

>> Table hierarchies
>> thing otherThing hierarchy
>> ------ ---------- ---------
>> 'fido' 'john' 'leader'
>> 'fido' 'mary' 'leader'
>> 'fido' 'luke' 'leader'
>> My solution had no redundancy ... but my point should already be clear:
>> there is no redundancy. You seem to be thinking that
>> storing the character strings 'fido' and 'leader' several times is bad.
>
>Redundancy is plainly obvious to me, so we must be working from
>different definitions of normalization.

That's what I've been trying to tell you all the time! I've been working from the definition of normalization that is defined by Date in the well known normal forms, a definition that is accepted by all experienced DBA's and programmers of RDBMS systems. You've been working from a definition of normalization that you've made up for yourself and, to put it bluntly, that is accepted by nobody but you yourself.

Which wouldn't have bothered me if you had taken the trouble of telling the world, right from the off, in the message that started this thread, that the challenge was for an implementation using an RDBMS but according to Neo's personal definition of normalization. But you didn't. Your message explicitly stated that to win the challenge, I had to be "using the relational model". Date's definitions of normalization are part of the relational model. Yours aren't. QED.

I already told you before: the relation model is about storing facts (represented as relationships), not about storing names. Names are like pointers (not to a memory location, but to something in the real world).

Redundancy means that you can take something away without changing the meaning of the whole. I'll give you two facts that share one name, now you show me how you can take something away from those two facts without changing the meaning of the whole:

  1. Neo owes Hugo $1000
  2. The International Bank Account Number (IBAN) of Hugo's bank account is NL59 RABO 0118 3365 68.

> Redundancy makes a solution
>less generic and more prone to problems over a border scope.

Less generic? No. More prone to problems? Yes. But since there is no redundancy in my implementation (neither in the first, nor in the second version), this is irrelevant.

> For
>instance, the above table's design prevents it from handling two
>things with the same name or a thing with no name.

That can be accomplished by an RDBMS as well, but it was not a requirement for the challenge. Making something more generic has a price. In an RDBMS, that price is usually a performance hit. I've grown accustomed to making my implementations exactly as generic as required. Not less (of course), but neither more (since in the real world, performance DOES matter).

Now you may think that bad practice, but my customers seem to like having applications that perform well while offering all flexibility they need over having an over-generic system that slowly slugs along.

>> for argument's sake assume ...
>> That would allow me to remove all 'fido's and 'leader's from the
>> hierarchies table. Instead, it would now look like this:
>>
>> Table hierarchies
>> thingID otherThingID hierarchyID
>> --------- ------------ -----------
>> (snip some rows)
>> 17 3 1
>> 17 4 1
>> 17 5 1
>>
>> And the table things would read:
>> thingID thingName
>> ------- ---------
>> 3 'john'
>> 4 'mary'
>> 5 'luke'
>> 17 'fido'
>
>> But if I follow your suggestion of "improving", I end up with the
>> same data in the existing tables and some extra relations that
>> I could do without.
>
>The above design is an improvement because it increases genericness.
>Two things can now have the same name. Also, the above design is a
>stepping stone to the next design which would allow a thing to be name
>less and not incur NULLs.
>
>The above design is an improvement because it reduces redundant data.
>The duplicate 17 in T_Hierarchies and T_Thing is not considered
>redundant (with respect to dbs) because it is unrelated to data being
>modelled. Separating data from the mechanism that links things with a
>db, allows maximum flexibility under a broader scope (ie allowing
>unamed things).

Oh no, now you've done it. That last sentence will probably set off the next Holy War between the Natural Key Group vs the Artificial Key Group.

I've seen quite enough of them Holy Wars, thank you very much. I'm a practical guy. While the Holy Followers Of The One And Only True Belief are attampting to flame the Holy Followers Of The Other One And Only True Belief to death in the newsgroups, practical people like me just pick either a natural key or an artificial key, whichever suits them best.

Oh and it might interest you that the defenders of the artificial key never claim that an artificial key is "more relational" or "less redundant" than a natural key - they generally defend their preferred choice because it's more practical and quicker. The defenders of the natural key *do* sometimes use arguments from relational theory to make clear why an artificial key should not be used in an RDBMS. Try googling one of the SQL Server newsgroups for "Joe Celko" +identity and you're bound to find something.

Best, Hugo

-- 

(Remove _NO_ and _SPAM_ to get my e-mail address)
Received on Thu May 27 2004 - 22:32:25 CEST

Original text of this message