Re: Does Codd's view of a relational database differ from that of Date & Darwin? [M.Gittens]

From: erk <eric.kaun_at_gmail.com>
Date: 9 Jun 2005 12:36:24 -0700
Message-ID: <1118345784.039295.19650_at_g47g2000cwa.googlegroups.com>


Paul wrote:
> I think the difference is that NULL is supposed to be a DBMS-wide thing,
> whereas Date & Darwen's "missing information" suggestion described in
> "Foundations for Future Database Systems" is explicitly a per-type thing.
>
> Here's a thought though:
>
> Suppose I have a column "col" of a integer type that has an UNKNOWN
> special value using Date & Darwen's scheme.
>
> Then I do "SELECT COUNT(*) FROM rel WHERE col != 2"
>
> Now the DBMS will have to understand the semantics of this UNKNOWN
> special value to return a result.

If you define "!=" for the domain of col such that it always returns true or false, you'll get the logical answer. It's when != returns UNKNOWN that you have a problem.

If you read Codd's (controversial) "Extending the Database Relational Model", he distinguishes between MAYBE THETA-SELECT and TRUE THETA-SELECT (with different syntax for each!), as he understood the problems in coping with missing values (even if he still jumps headlong into it with little exploration of their value).

> Should it return a count ignoring any
> rows that have an UNKNOWN value in that column? I would argue it
> shouldn't, but should return some kind of "unknown" result instead.

The result of the above query depends on whether SELECT is a TRUE SELECT or a MAYBE SELECT or, as in some of SQL's extremities, TRUE-OR-MAYBE. Codd, to his credit, does what most purveyors of 3VL don't - he forces a choice between TRUE and MAYBE, since the implications of confusion are big problems.

> But what type does it use for the answer? The COUNT aggregate function
> presupposes there exists an integer domain (specifially, a non-negative
> integer domain). Maybe here is a case where a DBMS-wide "unknown" marker
> is necessary?

Codd's "solution" uses a DBMS-wide marker - but for values in tuples, there's no good reason that a "typeless" null is needed. UNKNOWN is used in at least 2 different senses: as an "unknown boolean," and as an unknown of a domain type (which could include boolean). If col is of type T, and col==T.UNKNOWN, then "col != 2" evaluates to BOOLEAN.UNKNOWN. Pardon the syntax.

> What about "SELECT SUM(col1) FROM rel where col2 != 5" ?

SUM would have to be defined for UNKNOWN as well, or the entire result is... well, of type col1.domain.UNKNOWN.

> If the column col2 (of type type2) can contain special values like
> unknown, then also the type (type1) of column col1 would need a
> "unknown" special value, *even if it was never used directly as an
> attribute*.

Right.

> Does this show that DBMS-wide NULLs or their equivalents are an
> inevitable consequence of allowing type-specific unknown values?

No, it shows the work that proper handling of nulls leads to, and the limitations that a DBMS-wide null has. I don't think Codd discussed aggregate functions in a MAYBE SELECT.

> Or just that every type must have an "unknown" special value with
> identical semantics to each other (but possibly different representations?)

Not every type needs an unknown. While COUNT is presumably of type T->INTEGER (where T is of any type), SUM is more problematic, and varies with T. But it certainly isn't out of the realm of possibility to want AVG, SUM, and COUNT to work differently for different domains which allow missing values.  

  • erk
Received on Thu Jun 09 2005 - 21:36:24 CEST

Original text of this message