Re: How to cope with missing values - NULLS?

From: Paul Vernon <paul.vernon_at_ukk.ibmm.comm>
Date: Wed, 23 Apr 2003 13:41:17 +0100
Message-ID: <b86221$h26$1_at_gazette.almaden.ibm.com>


"Peter Koch Larsen" <pkl_at_mailme.dk> wrote in message news:61c84197.0304221440.32113f01_at_posting.google.com... > "Paul Vernon" <paul.vernon_at_ukk.ibmm.comm> wrote in message news:<b83u65$1imu$1_at_gazette.almaden.ibm.com>...
> > "Peter Koch Larsen" <pkl_at_mailme.dk> wrote in message
> > news:61c84197.0304220818.3050fbdf_at_posting.google.com...
> > [ snip ]
> > > There is not much of a difference the way I look at it and it really
> > > bugs down to how you feel the "=" operator should work.
> >
> > Indeed it does really boil down to how equality works..
> >
> > Quite simply I take it as a axiom that
> > x = x
> > for all possiable values x
> >
> > if you want to break this by introducing something that is not a value,
then
> > be my guest, but don't expect many others to follow you.

>
> Should I take this sentence as meaning that if only SQL defined NULL =
> NULL as true everywhere (i nearly wrote "in a consistent manner", but
> this must be superfluous) then the NULL-type would not be so bad - at
> least in special places such as the outer join (I cant think of any
> other place where nulls do occur naturally)?

Yes. NULL would not be so bad - it would be at least respectable - if it was a value rather than a 'marker'.

That said, I would not encourage the use of such a value across the board in a new, non-SQL DBMS. It is not a value with any well defined meaning IMO.

On your claim that NULLs occur 'natually' in outer joins. This is not so. Outer join is not a primative operation of the relational algebra. It is but a shorthand, vis

with

T1 {col1 INTEGER, col2 STRING}
T2 {col1 INTEGER, col2 STRING, col3 BOOLEAN, col4 INTEGER}

then

    T1 LEFT JOIN T2 FILL (col3 NOTAPPLICABLE, col4 0)

is shorthand (a 'macro') for: (apologies for any non exact Tutorial D syntax - I need to brush up)

    T1 JOIN

       ( T2{col1, col2, CAST_AS_BOOLEAN_NA(col3), col4} UNION
          ( EXTEND ( T2{col1, col2} MINUS T1 ) ADD NOTAPPLICABLE as col3, 0 as
col4)
        )


In SQL the shorthand is incomplete because there is no 'fill' specification, it uses NULLs for the 'fill'.

I think your argument boils down to whether a outer join shorthand should be allowed without a 'fill' specification. I would not include such a shorthand in a DBMS, but would not disallow a user from defining their own shorthand if they so wished. Such a user defined shorthand might go as far as using information in the catalog recorded against each type that defines whether the default 'fill' should be a value from that type (e.g. 0 for integer), or from a value in the union of the type with some other type. That however starts to get us into the realm of 'meta-programming', whereby in a relational language, the result relation type of a query depends of information not part of the database schema per-se.
The problem with any default fill specification, is that it is making assumptions about the meaning of a users LEFT JOIN statement. Sometimes 0 is correct for a fill, sometimes N/A, NOTYETOCCURED, UNK, BLANK, -1, etc. It all depends on the *actual* query that the user wants, a LOJ with a default fill specification is making *a guess* at what query the user actually wants, and unless the user sees the macro expanded (or has memorised the defaults), they won't know what query they are getting the result to.

To summarise, using left outer joins without explicitly specifying the fill, (using say COALSECE in SQL), is bad practice. Users should be forced to specify a fill to get them to fully understand (i.e. fully specify) their query.

If I review SQL that has un-COALESCEd LOJ columns, I send it back to be corrected.

Regards
Paul Vernon
Business Intelligence, IBM Global Services Received on Wed Apr 23 2003 - 14:41:17 CEST

Original text of this message