Re: How to cope with missing values - NULLS?

From: Peter Koch Larsen <pkl_at_mailme.dk>
Date: Thu, 24 Apr 2003 18:51:31 +0200
Message-ID: <3ea81614$0$42636$edfadb0f_at_dread11.news.tele.dk>


"Paul Vernon" <paul.vernon_at_ukk.ibmm.comm> skrev i en meddelelse news:b86221$h26$1_at_gazette.almaden.ibm.com...
> "Peter Koch Larsen" <pkl_at_mailme.dk> wrote in message
> news:61c84197.0304221440.32113f01_at_posting.google.com...
> > "Paul Vernon" <paul.vernon_at_ukk.ibmm.comm> wrote in message
> news:<b83u65$1imu$1_at_gazette.almaden.ibm.com>...
> > > "Peter Koch Larsen" <pkl_at_mailme.dk> wrote in message
> > > news:61c84197.0304220818.3050fbdf_at_posting.google.com...
> > > [ snip ]
> > > > There is not much of a difference the way I look at it and it really
> > > > bugs down to how you feel the "=" operator should work.
> > >
> > > Indeed it does really boil down to how equality works..
> > >
> > > Quite simply I take it as a axiom that
> > > x = x
> > > for all possiable values x
> > >
> > > if you want to break this by introducing something that is not a
value,
> then
> > > be my guest, but don't expect many others to follow you.
> >
> > Should I take this sentence as meaning that if only SQL defined NULL =
> > NULL as true everywhere (i nearly wrote "in a consistent manner", but
> > this must be superfluous) then the NULL-type would not be so bad - at
> > least in special places such as the outer join (I cant think of any
> > other place where nulls do occur naturally)?
>
> Yes. NULL would not be so bad - it would be at least respectable - if it
was a
> value rather than a 'marker'.
>
> That said, I would not encourage the use of such a value across the board
in a
> new, non-SQL DBMS. It is not a value with any well defined meaning IMO.

I do not see why. To me the semantics are quite clear and it allows the DBMS a nice way to support NULLs as we know them from SQL.

>
>

> On your claim that NULLs occur 'natually' in outer joins. This is not so.
> Outer join is not a primative operation of the relational algebra. It is
but a
> shorthand, vis

Yes - I am aware that an outer join corresponds to a UNION of a join and a difference operation. Still, I do not see the importance of that distinction as what constitutes a primitive operation can be debated. As an example, join is not normally considered a primitive operation as it can be replaced by a product, a projection and a restriction, but so far as I remember Date and Darwen has taken the opposite stance in TTM and consider join primitive.

>

> with
>

> T1 {col1 INTEGER, col2 STRING}
> T2 {col1 INTEGER, col2 STRING, col3 BOOLEAN, col4 INTEGER}
>

> then
> T1 LEFT JOIN T2 FILL (col3 NOTAPPLICABLE, col4 0)
>

> is shorthand (a 'macro') for: (apologies for any non exact Tutorial D
> syntax - I need to brush up)
>

> T1 JOIN
> ( T2{col1, col2, CAST_AS_BOOLEAN_NA(col3), col4} UNION
> ( EXTEND ( T2{col1, col2} MINUS T1 ) ADD NOTAPPLICABLE as col3,
0 as
> col4)
> )
> >

> In SQL the shorthand is incomplete because there is no 'fill'
specification,
> it uses NULLs for the 'fill'.

It might be incomplete but there is some convenience involved, I think. One advantage is that there is no need to create your own "NA"-types for every base-type you create (BOOLEAN_NA is such a type, I presume?). This ensures a consistent result no matter who creates the view. Also note that the if above join was used as some part of the query, we could formulate a query such as (reverting to SQL, with the LOJ being named V1):

SELECT * from V1 WHERE col3 = FALSE OR col3 = TRUE;

would not select all tuples. This is as it should be as col3 is not a BOOLEAN but a BOOLEAN_NA which has a third value (presumably) beyond TRUE and FALSE. I do not see how that could come as a surprise to the user.

(I'm averting the alternative phrase WHERE NOT col3 or col3). >
> I think your argument boils down to whether a outer join shorthand should
be
> allowed without a 'fill' specification. I would not include such a
shorthand
> in a DBMS, but would not disallow a user from defining their own shorthand
if
> they so wished. Such a user defined shorthand might go as far as using
> information in the catalog recorded against each type that defines whether
the
> default 'fill' should be a value from that type (e.g. 0 for integer), or
from
> a value in the union of the type with some other type. That however starts
to
> get us into the realm of 'meta-programming', whereby in a relational
language,
> the result relation type of a query depends of information not part of the
> database schema per-se.
> The problem with any default fill specification, is that it is making
> assumptions about the meaning of a users LEFT JOIN statement. Sometimes 0
is
> correct for a fill, sometimes N/A, NOTYETOCCURED, UNK, BLANK, -1, etc. It
all
> depends on the *actual* query that the user wants, a LOJ with a default
fill
> specification is making *a guess* at what query the user actually wants,
and
> unless the user sees the macro expanded (or has memorised the defaults),
they
> won't know what query they are getting the result to.
>
> To summarise, using left outer joins without explicitly specifying the
fill,
> (using say COALSECE in SQL), is bad practice. Users should be forced to
> specify a fill to get them to fully understand (i.e. fully specify) their
> query.

I have to disagree. I can not see why a non-filled outer join should give any surprises to the user. To argue based on the SQL way of handling nulls is not reasonable - their way of handling nulls is not consistent in the first place.

>

> If I review SQL that has un-COALESCEd LOJ columns, I send it back to be
> corrected.
>

> Regards
> Paul Vernon
> Business Intelligence, IBM Global Services
>
> Received on Thu Apr 24 2003 - 18:51:31 CEST

Original text of this message