Re: So what's null then if it's not nothing?

From: Jon Heggland <heggland_at_idi.ntnu.no>
Date: Wed, 30 Nov 2005 12:20:47 +0100
Message-ID: <MPG.1df7a8861fea27d298971a_at_news.ntnu.no>


In article <1133286593.699982.59590_at_g43g2000cwa.googlegroups.com>, boston103_at_hotmail.com says...
>
> Jon Heggland wrote:
> > By the exact same argument, duplicate NULLs should *not* be removed,
> > because they are not equal. Choosing one over the other is an arbitrary
> > decision; an artifact of how you formulate the SELECT DISTINCT
> > operation.
>
> No, the argument is not exactly the same. We have stipulated two
> independent rules of the game:
>
> a. two nulls are not distinct; b. two nulls are not equal

  1. Projection removes all but one of tuples that are not distinct.
  2. Projection retains tuples so that all are not equal to any other.

Pick either one, but I can't see how you can claim that one is more correct than the other, except for the reason that Codd said so.

> In the 3VL, neither implies the other.

They don't in 2VL either; the *negation* of one implies the other. And I don't see how this in any way invalidates my argument. Codd's projection rule is based on this:

"In applying this nonduplication rule, a null value in one tuple is regarded as the same as a null value in another. This identification of one null value with another may appear to be in contradiction with our assignment of truth value to the test w = w [NULL = NULL]. However, tuple identification for duplicate removal is an operation at a lower level of detail than equality testing in the evaluation of retrieval conditions. Hence, it is possible to adopt a different rule."

I don't buy this "lower level of detail" thing. It is taken out of thin air; handwaving to resolve a glaring inconsistency.

> Whether the rules make
> *practical sense* is another matter entirely ;)

Come on. That is exactly the matter *I* am discussing, at least.

> > And this "unknown maps to false"---where do you get that from?
>
> From the same article I've mentioned multiple times. E.g. select
> produces only the tuples for which the select predicate evaluates to
> true. In the 2VL, it means effectively that unknown is treated as
> false.

Codd never uses the phrase "maps to false", and I think it is dangerous and confusing to do so. E.g. it could be argued that since unknown "maps to false", "not unknown" maps to "not false", I.e. true. You say constraints leads to tuples being rejected if false; this is equivalent to saying that constraints are always true or unknown (of course, "or" here is not the logical operator)---if you want to "map" this to 2VL, it makes as much sense to say that constraints are always true, and unknown maps to true.

> > > Oh yes. Ascending or descending has got nothing to do with null
> > > 'ordering'. In order to place nulls at the top/bottom of a result set,
> > > one has to say 'order by nulls first/last' with 'last' being the
> > > default thus creating a user mapping that converts nulls to the
> > > least/greatest value.
> >
> > And to place 'A' before 'B', one has to say that 'order A before B'.
>
> I do not understand what you are trying to say here. There is some
> natural ordering defined on various data types , that's what 'order
> by' uses. There is obviously no such ordering amongst nulls,
> therefore one has to specify ordering manually, using words like nulls
> last or nulls first.

And for the user, at the logical level, it seems that the sorting algorithm compared null to 'A', and determined that it was less. To define a natural ordering on a data type, and to specify ordering manually---what is the difference?

> > > Say, you have a set {1,2,3}, then you
> > > get a new value 1. Clearly, you have a duplicate. Now, let's assume
> > > your new value is null. You cannot say whether it's a duplicate or not
> > > -- the result of set membership operation (or comparison if you prefer)
> > > is unknown. The 3vl unknown is mapped to false, therefore, null is
> > > *not* a duplicate.
> >
> > Now insert NULL again. Still no duplicate?
>
> We *do not know* so the rule of the game is to be optimistic and hope
> it is not a duplicate.

It may be that we are talking past each other. You seem to be explaining what the SQL rules *are*, while I am questioning if they are sensible--- and an appeal to "the rule of the game" does not cut it for me. I'm not comfortable with a logical system where { null, null } is a set.

> >Not according to SQL,
>
> What do you mean ? The SQL unique constraint does not consider two
> nulls as duplicates.

I know! But does it make sense?  

> > But now take a table with a composite key (or a set of tuples, if you
> > prefer). Insert (NULL, NULL). Insert (NULL, NULL) again. No duplicate.
> > Insert (1, NULL). No duplicate. Insert (1, NULL) again. Duplicate! Why?
>
> Because, the implementation(s) deviates from the standard (for a lot
> of practical reasons tha may or may not make sense to everyone).

Fair enough.

> > > With constraints, one wants to be sure that a value is *rejected* if
> > > the constraint is *NOT* satisfied.
> >
> > I might want to be sure that the value is accepted only if the
> > constraint is satisfied. Why is one more correct than the other? I know
> > which one I think is easier to state and understand.
>
> Then, in your pessimistic database world, you'd reject all the record
> for people with unknown phone number, unknown age, unknown middle
> name, etc.

Not at all, and I find it very strange that you draw this unfounded conclusion. Let me explain.

Say we have a 3VL database system with NULLs, with constraints required to evaluate to true. I have a table of people and their age, and age can be unknown, represented by NULL. And I have a restriction, saying "age > 10". In my world, a tuple with NULL age would indeed be rejected, as I expect it too---after all, NULL is not greater than 10. However, if I want to be able to put unknown ages in my table, I would have the constraint say "age is NULL or age > 10". This explicitly says that age may be unknown; if it is not, it must be greater than 10.

These two approaches (you may call them the pessimistic and the optimistic) have exactly the same expressive power. Therefore, none of them is more correct than the other. But I consider the pessimistic approach more intuitive and easier to understand. After all, SQL constraints are expressed as a positive predicate checking that some condition is satisfied, e.g. "CHECK age > 10". If the syntax instead had been negative, e.g. "REJECT IF age <= 10", the optimistic approach (allowing NULLs by default) would be more intuitive.

In fact, all database textbooks I have read (e.g. Elmasri and Navathe; Garcia-Molina, Ullman and Widom), describe constraints as "assertions" or "conditions" that "must hold" or "must hold true". With regard to SQL, this is actually incorrect; they must not not (sic) hold, which is a different thing in 3VL. It breeds confusion, and I think SQL is to blame, not the textbooks.

(By the way, does Codd say anything about whether constraints must be true, or merely not be false? I can't find anything in the article.)

> > Create a table with an attribute A, and a constraint that says that A
> > must be greater than five. Insert ten rows successfully. Select the rows
> > where A is greater than five. How many rows do you get?
>
> I'll get as many rows as there are satisfying the greater than 5
> predicate. There might be some rows with nulls, but I won't not get
> those (which should be evident based on our discussion).

Yes, and I believe this comes as a big surprise for people who have not followed our discussion, and not delved into the the intricacies of the SQL standard; but instead have taken at face value the obvious notion that constraints are meant to hold. Always.

Not to mention that it must bug the hell out of the query optimizer.

> > Ok. Now, with 2VL, the unique constraint behaviour does *not* depend on
> > how you define it, and there is no decision to arbitrarily make. It is
> > clear and simple, and doesn't surprise anyone. I think it is better.
>
> Then, do not use nulls and all the real or perceived problems will go
> away.

Except that SQL has a tendency to produce NULLs whether I want to or not, and the design and implementation of SQL systems are more complicated, error-prone and slow than they need to.

So, the solution to SQL's problems is to not use SQL. Fair enough, but not a very interesting conclusion.

-- 
Jon
Received on Wed Nov 30 2005 - 12:20:47 CET

Original text of this message