Re: NULLs

From: Hugo Kornelis <hugo_at_perFact.REMOVETHIS.info.INVALID>
Date: Fri, 28 Dec 2007 00:31:14 +0100
Message-ID: <ecb8n3918p2m7hn6ftj19i9l7d52mvso6a_at_4ax.com>


On Thu, 27 Dec 2007 01:32:45 -0800 (PST), stevedtrm wrote:

>> > If everyone is clear NULLS shouldn't be used, why the debate as to what
>> > to do about them ?
>>
>> Because SQL allows NULL and even promotes the idea that NULL solves some
>> problem instead of introducing many.
>
>So everyone is agreed that NULLs shouldn't appear anywhere, and its
>just a matter of time before NULLS become a legacy problem and a
>relational language supercedes SQL?

Hi Steve,

Frankly, I doubt it.

IMO, the real problem is not NULL, the real problem is that in reality, there will always be cases where for a single occurance of an entity, one or more attribute values are for whatever reason not on file. Since databases are an instrument to model reality, this concept of "missing information" has to be represented and handled somehow.

Most databases are built around the concept of combining predicates with the same determinator in a single table. This introduces a problem with missing information - take for instance a table that combines the predicates "<person> is <age> years old" and "<person> is of <gender>". If for a given occurance of the person entity the gender is on file but the age is not, the combined table still expects "something" to be in the age column. That's where the NULL marker comes in. It is the only safe solution - using a magic number (such as -1) to represent the fact that a persons age is not on file would work, but what if a domain allows all values representable by the data type?

The alternative table design that eliminates the use of NULL in the tables would split optional predicates to their own tables. In this case, that would mean at least two tables, possibly even three if we also need to be able to store persons for whom neither age nor gender is on file. But if we assume that gender is always on file, we can get away with two tables, one (Persons) with columns person and gender, to store the persons we have on file and their gender; the other (PersonAge) with columns person (foreign key into table Persons) and age, to store the age of persons. Persons without age on file are omitted from the table PersonAge.

This removes all NULL values from the tables. But it doesn't remove them completely. As soon as we have to create a report on all data for persons on file, we have to outer join the two tables and get the NULLs right back.

And even if there would be some way to produce such a report without having any NULL, then that would still not be a good solution. Because, as I said earlier, the problem is in the missing data, not in the NULLs used to represent the missing data. If you had the latter database design and you were asked to report on people over 40 years of age, would you include or exclude persons that are not in the PersonAge table?

If a representation of some data can be transformed into another transformation of the same data without losing or adding any information, both representations are exactly equivalent and there should not be any reason from a theoretic POV to dismiss one over the other. The representation without NULL would be like this:

Table Persons:                 Table PersonAge:
 person | gender                person | age
--------+--------              --------+-----
 Mary   | F                     Joe    | 42
 Joe | M

The representation with NULL would be like this:

Table Persons:
 person | gender | age
--------+--------+-----

 Mary   | F      | NULL
 Joe    | M      | 42

As long as you don't forget to read the NULL in the age column as "the age of Mary is not on file" and don't interpret anything else into it, these two representations are exactly equivalent.

>Are the two solutions I suggested before the widely accepted as
>resolutions to the two problems NULLs were introduced to eradicate?
>
>>Missing information? Then surely there should just be no tuple?

That requires seperating the tables. You can omit the tuple for Mary from the PersonAge table, but not from the Persons table that includes an age column.

>>To indicate that there can be no value? Why not a seperate table with
>>a boolean value in the non-key column?

Agreed. See my post to Bob.

Best, Hugo Received on Fri Dec 28 2007 - 00:31:14 CET

Original text of this message