Re: How to cope with missing values - NULLS?

From: Peter Koch Larsen <pkl_at_mailme.dk>
Date: 21 Apr 2003 11:52:46 -0700
Message-ID: <61c84197.0304211052.35551e93_at_posting.google.com>


Costin Cozianu <c_cozianu_at_hotmail.com> wrote in message news:<b7s926$42jf5$1_at_ID-152540.news.dfncis.de>...
> Peter Koch Larsen wrote:
> > Costin Cozianu <c_cozianu_at_hotmail.com> wrote in message news:<b7paca$3b0o0$1_at_ID-152540.news.dfncis.de>...
> >
> >>Peter Koch Larsen wrote:
> >>
[snip]
> >>>
> >>>My personal conclusion is that we are stuck with NULLS - if not in the
> >>>SQL sense then at least in the sense that we are to have some kind of
> >>>sum type. And if we are stuck with such a type, then the only sensible
> >>>thing is to have it standardised by the DBMS - how else are we going
> >>>to let the DBMS perform the outer join by itself?
> >>>
[snip]
> >>I think you are rushing to conclusions.
> >>
> >>If we can use sum types for individual columns, we surely can use sum
> >>types for tuple types, so here we have your OUTER JOIN without any NULL.
> >>
> >>Besides, coming from a relational theory perspective, I'm not sure if we
> >> should regard outer joins as a relation over the sum type of the
> >>corresponding tuples (no NULLs involved), or just a collection of relations.
> >>
> >> From an host language API point of view, I'd rather prefer the later.
> >>
> >>Another alternative is to regard the tuple type for the outer join as
> >>
> >> col1 * col2 * ... colk * restOfTheJoin
> >>
> >>where col1 to colk are the common columns and restOfTheJoin is the outer
> >>part of the outer join and has a sum type
> >> Nothing | col(k+1) * ... coln
> >>
> >>Best regards,
> >>Costin Cozianu
> >
> >
> > I am not sure I understand you completely. Let us say that we have
> > defined a view as an outer join: columns col1 and col2 are "common,
> > but col3 and col4 might be missing. Now let's query that view - we are
> > only interested in tuples, where col4 is larger than 10: SQL gives us
> > something like
> >
> > SELECT * FROM ojview WHERE col4 > 10
> >
> > How would you phrase such a query in your (hypothetical) relational
> > system?
> > The point I'm getting at is that it - from the users point of view -
> > would be nice to be able to formulate a query in a style where she can
> > refer to col4 directly. A sum type on the tuple as a whole would
> > possibly complicate matters? As an example I imagine a query in the
> > style of:
> >
> > SELECT * FROM ojview WHERE {is_extended} AND col4 > 10
> >
> > Where {is_extended} somehow determines that col4 exists. If it is
> > something like this You have in mind, this would have consequences not
> > only for the user but also for the optimizer: In what order should it
> > evaluate the restricting conditions? I am much in doubt if removing
> > NULLs in this situation would be feasible.
> >
> > Kind regards
> > Peter Koch Larsen
>
> First of all, I think that optimizers will work much better and more
> important, *correctly* in a 2 valued logic.

Why is this so? Is this related to the "weird" x = x being false if X is NULL? I do agree that this can cause problems for optimisers, but is it - in practical life - such a big problem?
>
> With regards to the above query, I'd rather propose a syntax similar
> with actual languages that use sum types
>
> SELECT ... WHERE
> MATCH restOfTheRecord WITH
> Empty -> false
> | Nonempty r -> (r.col4 > 10)

This looks very much like my

   SELECT * FROM ojview WHERE {is_extended} AND col4 > 10 proposal - just without any problems related to evaluation-order. A most important improvement, of course.
>
> Or even simpler synctactic forms can be envisioned, like defining an
> anonymous boolean function:
> FUNCTION (FullRecord r) -> ( r.col4 > 10 )
> | _ -> false
>
> Where _ matches any type and | si a type case or structure case
> discriminator.
>
> Yes, it will be a little bit more to write for the user , but it will
> work correctly, while queries with NULLs work on a logic of their own.

This "a bit more" is what scares me a bit. Imagine an outer join (outer) joined with another outer join. My feeling is we might easily end in a mess.
>
> Try to SELECT WHERE col4 < 10 OR col4v >= 10
In my opinion, this is an obscure query, where I would be able to accept both the SQL-way of doing things (return only tuples, where col4 is null) or the alternative (return only rows where col4 is null). I must require a firm definition of course - it would not be acceptable to have the result varyiung depending on the query optimiser.

>
> The complexity stems from the missing information itself which is not a
> trivial thing to deal with, and from the primitive mehcanisms of SQL.

So .... what is so terrible of returning - for those fields that might be absent - a marker for each field determining the presence of that field?

>
> What SQL does with NULL is that it enforces a relatively broken default
> where users really need the power and flexibility and logic to deal with
> on a case by case basis semantically.
>
> NULL is currently use to represent missing information, NOT APPLICABLE,
> or other ad-hoc tricks like INFINITY for dates (as per Joe Celko's
> recent examples). It is obvious that a broken three valued logic cannot
> match all these semantics with default rules.
>
> Let's take for example Joe Celko's representation of date time intervals:
>
> (begin_date DATE, end_date DATE)
>
> where NULL in the begin_date "means" from "beginning of time" and
> end_date NULL means "to the end of time". With sum types the user can
> easily define a proper data type for this information, and override the
> default "<" order operator to properly account for the order relation.
>
> What is even better, we can easily define begin_date INCLUSIVE/EXCLUSIVE:
>
> type TIME_POINT=
> BEGIN_OF_TIME
> | INCLUSIVE of DATE
> | EXCLUSIVE of DATE
> | END_OF_TIME
>
> function '<' (point1, point2) = match point1, point2 with
> BEGIN_OF_TIME, BEGIN_OF_TIME -> false
> | BEGIN_OF_TIME , _ -> true
> | _ , BEGIN_OF_TIME -> false
> ... equivalent rules for
> | INCLUSIVE d1, EXCLUSIVE d2 -> d1 <= d2
> | EXLCUSIVE d1, EXCLUSIVE d2 -> d1 < d2
> | INCLUSIVE d1, INCLUSIVE d2 -> d1 < d2
> | EXCLUSIVE d1, INCLUSIVE d2 -> d1 > d2
>
> And now all the conditions related to start_date, end_date using
> TIME_POINT can be easily and logically expressed in terms of the '<'
> operator.
>
> We can also imagine that for example end_date may come from a join, with
> the semantic that if no end_date is specified in the related table, then
> we mean EWN_OF_TIME. Then it is easy for the administrator to create an
> outer join where instead of NULL we'll have the corresponding
> END_OF_TIME.
I did not see much merit in Celko's infinite date either, and I am not arguing against the presence of "sum types" in the DBMS. What i would like is to investigate if a specially "systemdefined" sum-type could be a viable solution for e.g. the outer join and if so what such a sum-type would look like. In particular: how would this type differ from what SQL proposes?

>
> With NULLs in place trying to account for proper interval conditions
> (define 1 interval included in another, or 2 intervals intersect), and
> more so, using outer joins is at best very inelegant and error prone.

Why is it so much more difficult than using your solution? I see the general solution as being difficult no matter what - with SQL having (sometimes!) sensible default results. I would like to write som Haskell code to demonstrate what i mean, but I am to unfamiliar with haskell to do so. If I were to write similar code in C++ (which I know), my belief is that these templated types would behave very much the SQL way.

>
>
> What I'm trying to get at, is that missing information is a very complex
> problem in itself, first of all, and we can't sweep it under the use of
> NULL easily. At least current practice rather proves this assertion, and
> the confusion that made it even to the SQL committee is yet one more
> evidence of that.
>
> Second using special markers is a semantic problem that has to be dealt
> on a case by case basis, and no matter how many special markers we
> invent (Date claims that at one moment 6 special markers were proposed),
> they will not math the real problems of end users under one size fits
> all solution.
>
> Therefore we can easily let the DBMS engine work only in 2 valued logic,
> it is better for the optimizers, foir DBMS writers, for correctness and
> predictability of behavior, so on so forth. And then make it easy for
> users to define their own special types with markers for UNKNOWN, N/A,
> INFINITY and so on forth, using a decent type system.

I agree completely.

>
> best regards,
> Costin Cozianu

Regards
Peter Received on Mon Apr 21 2003 - 20:52:46 CEST

Original text of this message