Re: So what's null then if it's not nothing?

From: Hugo Kornelis <hugo_at_pe_NO_rFact.in_SPAM_fo>
Date: Fri, 18 Nov 2005 01:23:37 +0100
Message-ID: <ro4qn1hmsgulnjfvtbp09834mh1qc546v3_at_4ax.com>


On 16 Nov 2005 22:03:14 -0800, michael_at_preece.net wrote:

>I've been accustomed to thinking of things either having a value or
>not. If something has no value then, to me, its value is null. Its
>value is an empty string (whatever "it" is). Different to having a
>value of zero. Different to anything with a value.

Hi Mike,

But not different to something with the value '' (aka empty string).

> Now, as I read up on
>SQL, I find that null is supposed to mean "unknown".

You might want to start reading other books then. Null is *not* supposed to mean "unknown". According to the ANSI standard, the "null value" (as it's called there) is "A special value, or mark, that is used to indicate the absence of any data value".

Both Null and Unknown are important issues to understand when using SQL. Allow me to explain. Warning: long post ahead! <g>

First: NULL.

Let me give an example. I'll use tables - not realtional tables, but pen and paper tables. Forget that we're all using computers nowadays, and take a look at the two tables below (use a fixed width font).

 Emp# | First Name | Sex
------+------------+-----

    1 | Joyce      | F
    2 | John       | M

 Emp# | Age
------+-----

    2 | 47

What can you tell me about Joyce's age? The only correct answer to that question can be "nothing" - there is no information about Joyce's age in the tables above, and since you don't know her personally, there's nothing you know about her age.

Now let's combine the tables. Note that the information doesn't change, only the way we preresent it does:

 Emp# | First Name | Age | Sex
------+------------+-----+-----

    1 | Joyce      |     | F
    2 | John       |  47 | M

Since we didn't change the information, only the representation, we should still know the exact same thing about Joyce's age: nothing. This is indicated by the empty cell in the Age column of the row for Emp# 1.

This works for numeric columns, such as Age - since a blank cell clearly doesn't hold a valid number, it's obvious that the cell is indeed empty. But we have a problem with alphanumerical information. Damien already gave a greata example in his reply: the middle initial. Let's add this as an extra column to our table.

 Emp# | First Name | Age | Middle Initial | Sex

------+------------+-----+----------------+-----
    1 | Joyce      |     |                | F
    2 | John       |  47 |                | M

For reasons unrelated to this post, I don't know anything about John's middle initial - he might or might not have one, and if he does, I'm not privy to it. So I left the Middle Initial cell for John (Emp# 2) empty.

On the other hand, I *know* that Joyce has no middle initial. I asked her yesterday, and she confirmed. Since I hate forgetting personal details, I immediately wrote down an empty string in Joyce's Middle Initial cell.

But how will I be able to see the difference between John's empty cell and Joyce's cell that contains the empty string? I guess I'll have to use color codes, or something like that.

In relational databases, that is solved by Null - a marker that explicitly denotes that there is no value in a column, to distinguish it from a column that holds an empty string, a numeric value of zero or a date value of "beginning of time" (or whatever).

Next: UNKNOWN.

Null is not the same as Unknown. Many people like to say that it is, but it's not true. A Null in a database is often a result of something begin unknown. I already admitted that I don't know John's middle initial. So you might say that John's middle initial is unknown (to me at least; I'm sure John does know), and that is represented by Null.

But nowhere have I written Joyce's age is unknown. It might be - but there might be other reasone for her age being absent from the tables above. Maybe it's a permissions issue (she's working for another department; I've got full access to personal details of my department's personnel, but limited acces for other employees). Maybe the company's director is an old-fashioned guy who considers it rude to ask a lady her age - maybe I do know too well how old Joce is, but am precluded by company policy fromm entering it in the table. Or maybe you were right after all, and I simply don't know Joyce's age. Who'll tell??

Null is the absence of a value. There can be many reasons for this absense. If it's important to know the reason, then this reason should be modelled as a seperate piece of information. Since there is no such data in the example tables above, it's clear that for this company, it's not important to know WHY nothing is known about Joyce's age.

But since Joyce's age is Null, her age IS unknown *within the context of the database*. Any program that uses data from these tables, and any person that looks at these tables and has no further knowledge, will know nothing about Joyce's age. For those programs and people, Joyce's age will be not known (but only for them - so this is not the same as saying that Joyce age is unknown).

I deliberately wrote "not known" above, to avoid confusion with "Unknown". As asid, Unknown is not Null. Unknown is a boolean value that extends the well-known True and False boolean values.

Given the tables above, would you say that the statement "John is older than Joyce" is true or false?

The only correct answer to that can be "I don't know". Since you know nothing about Joyce's age, you can't judge whether John is older than her or not.

So can we say that the answer to the question above should be Null? No, we can't. Null means the absence of a value. But we just discusssed the statement about John's and Joyce's age, so there now IS an answer to the question if John is older. The answer would have been Null before I wrote these paragraphs, but the question has been answered since. Not with "yes" (True) or "no" (False), but with "beats me" (Unknown).

And that's exactly how the ANSI standard defines Unknown - compare any value with Null, and the result will never be True or False, but always Unknown.

>I can't
>imagine having to write code where the "if a=b then result=true else
>result=false" construct won't work - according to what I'm reading, if
>either a or b is null then I should be setting result to unknown
>instead.

Well, you don't have to write such code. Just buy and install a decent mainstream RDBMS product. They have this logic embedded.

Best, Hugo

-- 

(Remove _NO_ and _SPAM_ to get my e-mail address)
Received on Fri Nov 18 2005 - 01:23:37 CET

Original text of this message