Re: 3vl 2vl and NULL

From: Hugo Kornelis <hugo_at_pe_NO_rFact.in_SPAM_fo>
Date: Fri, 09 Dec 2005 00:40:06 +0100
Message-ID: <36dhp1dv7d9r2r2ujp4qeor3cnunbsmp8m_at_4ax.com>


On 6 Dec 2005 17:59:07 -0800, michael_at_preece.net wrote:

>Look Hugo - I think we're each getting sick of this now. Please look at
>the discussion between Codd & Date, and in particular the discussion
>about "not known" and "known not".

Hi Mike,

Yes, I've read the article. In my opinion, both Codd and Date got it wrong. But Date got it more wrong than Codd.

Codd's first arguments are good. In fact, he already mentions an argument that is very close to what will be my main argument when I come to my opinion:

'1. The meaning of the fact that a value is missing from some part or column of a relational database is quite different from the meaning of a value that is legitimate within that part or column.'

In Date's reaction ("Date's Rebuttal I") he writes:

'A (Codd-style) 3VL system supports just one type of null, "value unknown"'.
He then presents an example (which is flawed, becuase the example presented calls for a subtype) and concludes with: 'Joe's commission is quite likely to be misrepresented as "value unknown" (it should of course be "value does not apply").'

Interesting here is that Date changes Codd's definition "value is missing" to something that suits his argument better ("value unknown"). He doesn't explicitly mention this change, nor does he defend this change with as much as one word.

After misrepresenting Codd's original argument and "proving" the need for a second type of NULL on top of that misrepresented first type of NULL, Date then goes on to prove that it doesn't stop here.

In Codd's reaction, he appears to have missed that Date built his argument on a misrepresentation of Codd's words. He accepts Date's point that "value unknown" and "value does not apply" are different and should be represented by seperate marks. 3VL has now become 4VL. Codd seems to think that it can stop there, but I'm afraid that Date is right: once you start to make the distinction between different kinds of NULL, there's no end to it.

In my opinion, Codd should have sticked to his original model. NULL is not "value unknown", not "value does not apply". NULL is "no value".

"The age of Aunt Marge is 47 years" is a fact. "The age of Uncle Vernon is unknown" is also a fact - but it's not the same fact type as the former. And since they are different fact types, they should be stored in differrent columns. Attempting to store them in the same column would be a violation of 1NF.

Or, to put it differently, and almost in the words of Codd:

The meaning of the fact that (and the reason why) a value is missing from some part or column of a relational database is quite different from the meaning of a value that is legitimate within that part or column. And since the meanings are different, they can never be combined in the same column.

> Don't try to tell me
(snip)

I'll tell you, or anybody, how I think about this subject. If you rather not read my opinions, skip my messages or killfile me.

> Much of the discussion between
>Codd & Date (see link to article in dbdebunk in the other thread)
>concerned the *fact* that NULL can be interpreted to mean things other
>than simply "unknown".

I won't deny that NULL *can* be interpreted to mean other things than simply "no value here". Your "unknown" is one of them.

For a lot of NULLs in my tables, I know why there's no value in that cell. I know it, because I understand the real-life entities and their properties that are represented by the values in my database. But that understanding is not part of the NULL marker in the DB itself.

The value "2" in a column with datatype INTEGER can also be interpreted in many ways. And in most cases, someone with domain knowledge will know the correct interpretation. And yet, nobody is arguing that "2 years" should be encoded in a different way from "2 children".

We agree that NULL *can* be interpreted to mean many things. I add to this that the user of the database will usually (though probably not always) know which interpretation is correct, based on domain knowledge and/or conclusions that can be drawn from other values in the database.

But the database itself can not make any of these interpretations. For the database, NULL should never mean anything more than "no value here".

> If I look at my empty mug I can truthfully say
>that there is an absence of any liquid in it - because it's empty.

And that fact can truthfully be represented in a column MugEmpty with domain {'Yes', 'No'}, or in a column MugContents with a numeric domain that stores the amount of liquid (in cc) in your mug.

> That
>does *not* mean I don't know what's in it. I don't have to look for any
>other facts. It is, very plainly and obviously, empty. If I have a rule
>that I only store data (known facts) in a database then I'll know that
>where I have a NULL then the NULL represents an absence of any value.
>It most certainly will not signify that I have an unknown value. It
>will signify that there is no value where one would normally be
>expected.

Exactly! That's what I've been saying all the time: NULL represents the absence of a value. Since the question "what is in your mug" can't be answer currently, there is no value to store in the cell that is reserved to hold this answer. It'll be NULL - no value here. This cell in the database is empty.

> It will signify that the "thing" is empty.

If "thing" refers to the cell in the database: indeed.

If "thing" refers to your mug: NO. There can be many reasons why the answer to the question "what is in your mug" is not in the database. Maybe you refused to answer. Maybe I forgot to ask. Maybe you didn't know. Maybe the mug was empty. Maybe you told me, but I forgot. Maybe you told me, but I didn't understand. Or it was top secret, and I was not allowed to store it in the database. Need I go on?

If the MugLiquid column is NULL, it signifies only that there is no value in the DB that represents the liquid currently in the mug. If the designer of the table thought I'd need to know *why* this value is missing, he/she should have added one or more extra columns for this purpose.

Unless, of course, there is already a column that can be used to find out why this cell is NULL. If MugLiquid is NULL *and* the value of MugContents is 0 cc, then any human user can conclude that the MugLiquid column is NULL because the mug is empty and the question what's in it therefore doesn't apply. And any developer can code that logic in an application or a query to make the end user believe that the database can conclude this as well. But the database doesn't - the developer does.

> I will be keeping
>my database "pure", in that I will only be recording data (known
>facts).

And so will I. The structure of a SQL database, with it's fixed rectangular tables, will sometimes force me to use a NULL marker to depict that I have no facts to populate that particular cell with. But that's representation. In the end, both your Pick DB and my SQL DB will have recorded the very same facts.

> Any questions I ask of my database will provide absolute yes/no
>known truth - or it will simply and honestly tell me that it has
>insufficient data.

And so will mine (of course - since both DB's store the exact same collection of facts).

> If, for instance, I ask for information on countries
>with or without Weapons of Mass Destruction, there will be some
>countries - like the USA - with known-non-empty values for WMD
>attributes, some - like New Zealand - with known-empty-NULL values for
>WMD attributes, and some with no data at all - because there simply is
>no reliable information.

I have no idea what "knwon-empty-NULL values" is supposed to mean. But if you ever have to work with a SQL database, then PLEASE remember to store New Zealand's amount of nuclear warheads as 0, not as NULL. Using NULL to say that there definitely are none might be valid in Pick (I'll just have to trust you on that, as I don't know PicK), but it's absolutely not valid in an SQL table. In SQL, storing NULL means removing the fact "New Zealand has 0 nuclear warheads" from the table. That would place New Zealand -incorrectly!- in the group of countries that have no data at all (either because there is no reliable information, or for whatever other reason - I don't kinow, though the users of this DB will).

And if either you, with your Pick DB, or me, with my SQL DB, have to build a report, then we should make absolutely sure that we get the requirements straigth - does Mr. President want a list of all countries that we *know* to have WMD, or does he wish to also include all countries that *might* have WMD. Your query will look different from mine. But if we don't mess up, we'll present the same list to Mr. President.

> I make the point again - there is no place in
>a database for unknown values - and pretending otherwise will
>inevitably lead to problems, sometimes major problems.

Agreed.

> MVL is
>dangerous.

MVL is like a bread knife - dangerous in unskilled hands, but invaluable in skilled hands.

Best, Hugo

-- 

(Remove _NO_ and _SPAM_ to get my e-mail address)
Received on Fri Dec 09 2005 - 00:40:06 CET

Original text of this message