Re: How to cope with missing values - NULLS?

From: Costin Cozianu <c_cozianu_at_hotmail.com>
Date: Sat, 19 Apr 2003 12:50:48 -0700
Message-ID: <b7s926$42jf5$1_at_ID-152540.news.dfncis.de>


Peter Koch Larsen wrote:

> Costin Cozianu <c_cozianu_at_hotmail.com> wrote in message news:<b7paca$3b0o0$1_at_ID-152540.news.dfncis.de>...
> 

>>Peter Koch Larsen wrote:

>>
>>>How to cope with missing values - NULLS?
>>>The to me fundamental problem is how to cope with missing values. I
>>>agree that You have no need for them in base tables, but what happens
>>>in the situation where you have an outer join? If outer joins are to
>>>be provided by the DBMS - and I believe they are to useful to be let
>>>out - there must be some means of denoting that a field has no value.
>>>There are three ways to go:
>>>
>>>1) Use a NULLABLE type. This could be an extension like the sum-type
>>>mentioned (i have not read that thread yet so i can not comment). I
>>>see no problems with that approach.
>>>2) Use a DEFAULT value. This could be feasible in some situations, but
>>>in many situations it would not. The most obvious example would be if
>>>the field in question is boolean. Which default could you possible
>>>use?
>>>3) Use a marker that existed independently of the field. This approach
>>>has so many flaws that I believe it to be infeasible. For one thing,
>>>what value should we store in that field? If we have ADT's we must be
>>>careful as the ADT might have some constraints, that if not set will
>>>invalidate the type - a CIRCLE could as an example be required a
>>>non-negative radius. Even non-ADT types might have some integrity
>>>constraints: think about a floating point number stored in the IEEE
>>>format.
>>>
>>>My personal conclusion is that we are stuck with NULLS - if not in the
>>>SQL sense then at least in the sense that we are to have some kind of
>>>sum type. And if we are stuck with such a type, then the only sensible
>>>thing is to have it standardised by the DBMS - how else are we going
>>>to let the DBMS perform the outer join by itself?
>>>
>>>Kind regards
>>>Peter
>>
>>
>>I think you are rushing to conclusions.
>>
>>If we can use sum types for individual columns, we surely can use sum
>>types for tuple types, so here we have your OUTER JOIN without any NULL.
>>
>>Besides, coming from a relational theory perspective, I'm not sure if we
>> should regard outer joins as a relation over the sum type of the
>>corresponding tuples (no NULLs involved), or just a collection of relations.
>>
>> From an host language API point of view, I'd rather prefer the later.
>>
>>Another alternative is to regard the tuple type for the outer join as
>>
>> col1 * col2 * ... colk * restOfTheJoin
>>
>>where col1 to colk are the common columns and restOfTheJoin is the outer
>>part of the outer join and has a sum type
>> Nothing | col(k+1) * ... coln
>>
>>Best regards,
>>Costin Cozianu
> 
> 
> I am not sure I understand you completely. Let us say that we have
> defined a view as an outer join: columns col1 and col2 are "common,
> but col3 and col4 might be missing. Now let's query that view - we are
> only interested in tuples, where col4 is larger than 10: SQL gives us
> something like
> 
>    SELECT * FROM ojview WHERE col4 > 10
> 
> How would you phrase such a query in your (hypothetical) relational
> system?
> The point I'm getting at is that it - from the users point of view -
> would be nice to be able to formulate a query in a style where she can
> refer to col4 directly. A sum type on the tuple as a whole would
> possibly complicate matters? As an example I imagine a query in the
> style of:
> 
>    SELECT * FROM ojview WHERE {is_extended} AND col4 > 10
> 
> Where {is_extended} somehow determines that col4 exists. If it is
> something like this You have in mind, this would have consequences not
> only for the user but also for the optimizer: In what order should it
> evaluate the restricting conditions? I am much in doubt if removing
> NULLs in this situation would be feasible.
> 
> Kind regards
> Peter Koch Larsen

First of all, I think that optimizers will work much better and more important, *correctly* in a 2 valued logic.

With regards to the above query, I'd rather propose a syntax similar with actual languages that use sum types

SELECT ... WHERE

	MATCH restOfTheRecord WITH
		Empty -> false
		| Nonempty r -> (r.col4 > 10)

Or even simpler synctactic forms can be envisioned, like defining an anonymous boolean function:

     FUNCTION (FullRecord r) -> ( r.col4  > 10 )
		| _ -> false

Where _ matches any type and | si a type case or structure case discriminator.

Yes, it will be a little bit more to write for the user , but it will work correctly, while queries with NULLs work on a logic of their own.

Try to SELECT WHERE col4 < 10 OR col4v >= 10

The complexity stems from the missing information itself which is not a trivial thing to deal with, and from the primitive mehcanisms of SQL.

What SQL does with NULL is that it enforces a relatively broken default where users really need the power and flexibility and logic to deal with on a case by case basis semantically.

NULL is currently use to represent missing information, NOT APPLICABLE, or other ad-hoc tricks like INFINITY for dates (as per Joe Celko's recent examples). It is obvious that a broken three valued logic cannot match all these semantics with default rules.

Let's take for example Joe Celko's representation of date time intervals:

(begin_date DATE, end_date DATE)

where NULL in the begin_date "means" from "beginning of time" and end_date NULL means "to the end of time". With sum types the user can easily define a proper data type for this information, and override the default "<" order operator to properly account for the order relation.

What is even better, we can easily define begin_date INCLUSIVE/EXCLUSIVE:

type TIME_POINT=

	  BEGIN_OF_TIME
	| INCLUSIVE of DATE
	| EXCLUSIVE of DATE
	| END_OF_TIME

function '<' (point1, point2) = match point1, point2 with
	BEGIN_OF_TIME, BEGIN_OF_TIME -> false
	| BEGIN_OF_TIME , _ -> true
	| _ , BEGIN_OF_TIME -> false
   ... equivalent rules for
	| INCLUSIVE d1, EXCLUSIVE d2 -> d1 <= d2
	| EXLCUSIVE d1, EXCLUSIVE d2 -> d1 < d2
	| INCLUSIVE d1, INCLUSIVE d2 -> d1 < d2
    	| EXCLUSIVE d1, INCLUSIVE d2 -> d1 > d2

And now all the conditions related to start_date, end_date using TIME_POINT can be easily and logically expressed in terms of the '<' operator.

We can also imagine that for example end_date may come from a join, with the semantic that if no end_date is specified in the related table, then we mean EWN_OF_TIME. Then it is easy for the administrator to create an   outer join where instead of NULL we'll have the corresponding END_OF_TIME. With NULLs in place trying to account for proper interval conditions (define 1 interval included in another, or 2 intervals intersect), and more so, using outer joins is at best very inelegant and error prone.

What I'm trying to get at, is that missing information is a very complex problem in itself, first of all, and we can't sweep it under the use of NULL easily. At least current practice rather proves this assertion, and the confusion that made it even to the SQL committee is yet one more evidence of that.

Second using special markers is a semantic problem that has to be dealt on a case by case basis, and no matter how many special markers we invent (Date claims that at one moment 6 special markers were proposed), they will not math the real problems of end users under one size fits all solution.

Therefore we can easily let the DBMS engine work only in 2 valued logic, it is better for the optimizers, foir DBMS writers, for correctness and predictability of behavior, so on so forth. And then make it easy for users to define their own special types with markers for UNKNOWN, N/A, INFINITY and so on forth, using a decent type system.

best regards,
Costin Cozianu Received on Sat Apr 19 2003 - 21:50:48 CEST

Original text of this message