| Oracle FAQ | Your Portal to the Oracle Knowledge Grid | |
Home -> Community -> Usenet -> comp.databases.theory -> Re: How to cope with missing values - NULLS?
Peter Koch Larsen wrote:
> Costin Cozianu <c_cozianu_at_hotmail.com> wrote in message news:<b7paca$3b0o0$1_at_ID-152540.news.dfncis.de>... >
>
>
> I am not sure I understand you completely. Let us say that we have
> defined a view as an outer join: columns col1 and col2 are "common,
> but col3 and col4 might be missing. Now let's query that view - we are
> only interested in tuples, where col4 is larger than 10: SQL gives us
> something like
>
> SELECT * FROM ojview WHERE col4 > 10
>
> How would you phrase such a query in your (hypothetical) relational
> system?
> The point I'm getting at is that it - from the users point of view -
> would be nice to be able to formulate a query in a style where she can
> refer to col4 directly. A sum type on the tuple as a whole would
> possibly complicate matters? As an example I imagine a query in the
> style of:
>
> SELECT * FROM ojview WHERE {is_extended} AND col4 > 10
>
> Where {is_extended} somehow determines that col4 exists. If it is
> something like this You have in mind, this would have consequences not
> only for the user but also for the optimizer: In what order should it
> evaluate the restricting conditions? I am much in doubt if removing
> NULLs in this situation would be feasible.
>
> Kind regards
> Peter Koch Larsen
First of all, I think that optimizers will work much better and more important, *correctly* in a 2 valued logic.
With regards to the above query, I'd rather propose a syntax similar with actual languages that use sum types
SELECT ... WHERE
MATCH restOfTheRecord WITH Empty -> false | Nonempty r -> (r.col4 > 10)
Or even simpler synctactic forms can be envisioned, like defining an anonymous boolean function:
FUNCTION (FullRecord r) -> ( r.col4 > 10 )
| _ -> false
Where _ matches any type and | si a type case or structure case discriminator.
Yes, it will be a little bit more to write for the user , but it will work correctly, while queries with NULLs work on a logic of their own.
Try to SELECT WHERE col4 < 10 OR col4v >= 10
The complexity stems from the missing information itself which is not a trivial thing to deal with, and from the primitive mehcanisms of SQL.
What SQL does with NULL is that it enforces a relatively broken default where users really need the power and flexibility and logic to deal with on a case by case basis semantically.
NULL is currently use to represent missing information, NOT APPLICABLE, or other ad-hoc tricks like INFINITY for dates (as per Joe Celko's recent examples). It is obvious that a broken three valued logic cannot match all these semantics with default rules.
Let's take for example Joe Celko's representation of date time intervals:
(begin_date DATE, end_date DATE)
where NULL in the begin_date "means" from "beginning of time" and end_date NULL means "to the end of time". With sum types the user can easily define a proper data type for this information, and override the default "<" order operator to properly account for the order relation.
What is even better, we can easily define begin_date INCLUSIVE/EXCLUSIVE:
type TIME_POINT=
BEGIN_OF_TIME
| INCLUSIVE of DATE
| EXCLUSIVE of DATE
| END_OF_TIME
function '<' (point1, point2) = match point1, point2 with
BEGIN_OF_TIME, BEGIN_OF_TIME -> false
| BEGIN_OF_TIME , _ -> true
| _ , BEGIN_OF_TIME -> false
... equivalent rules for
| INCLUSIVE d1, EXCLUSIVE d2 -> d1 <= d2
| EXLCUSIVE d1, EXCLUSIVE d2 -> d1 < d2
| INCLUSIVE d1, INCLUSIVE d2 -> d1 < d2
| EXCLUSIVE d1, INCLUSIVE d2 -> d1 > d2
And now all the conditions related to start_date, end_date using TIME_POINT can be easily and logically expressed in terms of the '<' operator.
We can also imagine that for example end_date may come from a join, with the semantic that if no end_date is specified in the related table, then we mean EWN_OF_TIME. Then it is easy for the administrator to create an outer join where instead of NULL we'll have the corresponding END_OF_TIME. With NULLs in place trying to account for proper interval conditions (define 1 interval included in another, or 2 intervals intersect), and more so, using outer joins is at best very inelegant and error prone.
What I'm trying to get at, is that missing information is a very complex problem in itself, first of all, and we can't sweep it under the use of NULL easily. At least current practice rather proves this assertion, and the confusion that made it even to the SQL committee is yet one more evidence of that.
Second using special markers is a semantic problem that has to be dealt on a case by case basis, and no matter how many special markers we invent (Date claims that at one moment 6 special markers were proposed), they will not math the real problems of end users under one size fits all solution.
Therefore we can easily let the DBMS engine work only in 2 valued logic, it is better for the optimizers, foir DBMS writers, for correctness and predictability of behavior, so on so forth. And then make it easy for users to define their own special types with markers for UNKNOWN, N/A, INFINITY and so on forth, using a decent type system.
best regards,
Costin Cozianu
Received on Sat Apr 19 2003 - 14:50:48 CDT
![]() |
![]() |