Re: So what's null then if it's not nothing?

From: Hugo Kornelis <hugo_at_pe_NO_rFact.in_SPAM_fo>
Date: Wed, 30 Nov 2005 23:11:15 +0100
Message-ID: <ji7so1l73jsh1rkjencq25ibirrh7oe8h3_at_4ax.com>


On Tue, 29 Nov 2005 17:36:59 +0100, Jon Heggland wrote:

>In article <1133277089.665394.263550_at_o13g2000cwo.googlegroups.com>,
>boston103_at_hotmail.com says...
>> Jon Heggland wrote:
>> [...]
>>
>> > A SELECT DISTINCT (it is really too kind to call this SQL
>> > construct "projection") eliminates duplicate NULLs, just like it
>> > eliminates duplicate values. To handwave this by saying "well, they're
>> > not *distinct*, but that doesn't mean they're *equal*" is just a bad
>> > excuse.
>>
>> It's simple really if you bother to read something about the 3vl.
>> According to the 3vl rules, whether two nulls are distinct or equal is
>> unknown which maps to false in the 2vl.
>
>By the exact same argument, duplicate NULLs should *not* be removed,
>because they are not equal. Choosing one over the other is an arbitrary
>decision; an artifact of how you formulate the SELECT DISTINCT
>operation.

Hi Jon,

I agree with this. Removing NULLs from the results of a DISTINCT operation is weird.

But let's place this in a broader perspective. How should NULLs be handled in a GROUP BY operation. Here you could also say that each NULL should be a group of it's own - probably the mathematical most correct way to handle it. But there also much to be said for the idea of defining that in a GROUP BY, all NULLs should go in one group - like the "Other" group you typically find in the results of a poll. Though arguably less "correct", this approach is definitely more geared towards practical use. I think that the ANSI committee chose the most practical approach - and I'm glad they did, otherwise we'd all be working around this all the time!

Back to DISTINCT. This operation is tightly related to GROUP BY. Remove the aggregate expressions, and GROUP BY and DISTINCT even become synonyms. That would change if DISTINCT would keep NULLs seperate from each other. This might have been the reason for the ANSI committee to define the results of NULLs in DISTINCT analogous to how they defined the results of NULLs in GROUP BY.

But I still agree - it IS weird, and I believe that DISTINCT should actually retain each NULL instead of lumping them together.

>And to place 'A' before 'B', one has to say that 'order A before B'.

Indeed: ORDER BY Column1 ASCENDING

>> > > The
>> > > SQL'92 standard says that the column value is a duplicate if it's
>> > > already present in the set of existing column non-nulls
>> >
>> > I'm not able to parse this.
>> >
>>
>> What's so hard about that ?
>
>I didn't understand the phrase "set of existing column non-nulls".

Take all rows in the table - that's the set of existing rows. Now strip all columns not used in the UNIQUE constraint. What's left is the set of existing column values.
Remove all rows that have NULL in one of the columns. You now have the set of existing non-nulls.

New values have to be checked against this set.

>> Say, you have a set {1,2,3}, then you
>> get a new value 1. Clearly, you have a duplicate. Now, let's assume
>> your new value is null. You cannot say whether it's a duplicate or not
>> -- the result of set membership operation (or comparison if you prefer)
>> is unknown. The 3vl unknown is mapped to false, therefore, null is
>> *not* a duplicate.
>
>Now insert NULL again. Still no duplicate? Not according to SQL, and it
>seems you agree that that is sensible.

Correct. (Though Microsoft SQL Server get this wrong - it will reject the second NULL - I can't call this a bug, because it appears to be by design, but I can call and will call this design awful).

>But now take a table with a composite key (or a set of tuples, if you
>prefer). Insert (NULL, NULL). Insert (NULL, NULL) again. No duplicate.
>Insert (1, NULL). No duplicate. Insert (1, NULL) again. Duplicate! Why?

If your DB rejects the second insert, you've found a bug. (Or an awful design <grin>).

The second (1, NULL) should be allowed; it's not a duplicate according to the ANSI definitions.

>> > > So for NULL, the DUPLICATE predicate would return UNKNOWN (or
>> > > NULL) which is then mapped to false
>> >
>> > Indeed? AFAIK, SQL constraints are considered satisfied if they are true
>> > or if they are unknown.
>>
>> With constraints, one wants to be sure that a value is *rejected* if
>> the constraint is *NOT* satisfied.
>
>I might want to be sure that the value is accepted only if the
>constraint is satisfied. Why is one more correct than the other? I know
>which one I think is easier to state and understand.

In most cases, that's as easy as adding a NOT NULL constraint to the column.

If that doesn't work in a specific case, then make sure that you formulate the constraint such that it will evaluate to false for each row you don't want to accept.

It's not a question of more or less correct, BTW. It's just a question of how the rules are defined. But it might help to consider that *if* a constraint would reject all rows that evaluate to false or unknown, most CHECK constraints would implicitly disallow NULLs in many columns. The current definition gives the DBA the control to either add or not add an explicit NOT NULL constraint.

Best, Hugo

-- 

(Remove _NO_ and _SPAM_ to get my e-mail address)
Received on Wed Nov 30 2005 - 23:11:15 CET

Original text of this message