Re: boolean datatype ... wtf?

From: Hugo Kornelis <hugo_at_perFact.REMOVETHIS.info.INVALID>
Date: Tue, 05 Oct 2010 00:53:28 +0200
Message-ID: <1nlka65h8muegogddo9k465jb0koh3i5cv_at_4ax.com>


On Sat, 02 Oct 2010 10:42:50 -0300, Bob Badour wrote:

>Hugo Kornelis wrote:
>
>> On Fri, 1 Oct 2010 07:51:37 -0700 (PDT), Paul Mansour wrote:
>>
>>>Erwin, suppose then that we do want to explicity take "unknown" into
>>>account with two checkboxes as you say, or I think better yet, with
>>>three mutually exclusive radio buttons, to explicity specifiy Yes,
>>>No, and Unknown or Male,Female and Unkown, or what have you. How then
>>>would you model this? Is this a column with three allowable values,
>>>or a column with two values where no corresponding row indicates the
>>>third option? Alternatively, there could be just two radio buttons
>>>indicating, say, Male and Female, and neither one is pre-selected, so
>>>it is clear whether or not the question was answered. Same question,
>>>how is this then modeled? I think Tony Andrews has a good question,
>>>and it's more fundamental than the GUI.
>>
>> The most logical choices, IMO, are:
>>
>> 1) A nullable column with two possible values. Whether boolean or
>> CHAR(n) and a CHECK constraint for the legal values is immaterial; any
>> column with two possible values is basically a boolean in disguise
>> (except that the DBMS doesn't hand you the algebraic operations on a
>> platter, you have to explicitly code the ones you need).
>>
>> 2) For the NULL haters, that same column (but now not nullable) in a
>> seperate table, so you can avoid the NULL by omitting the row there.
>>
>> (And of course, the NULL haters who also don't like extra tables will
>> simply add a third value to the CHECK constraint and add the extra logic
>> in their queries to make that third value act like NULL, so they can
>> think that they have avoided the issue by simply renaming it).
>>
>> Best, Hugo
>
>Hi Hugo,
>
>What makes either of those choices logical?

Hi Bob,

Sometimes, the simplest looking questions work out to be the hardest to answer. It all made so much sense to me when I wrote this (and still does), that I find it hard to explain why this appears logical to me.

Pauls question was about how to model a situation where it is required to represent three possibilities: male, female, or gender unknown.

My first observation is that these three possibilities do not come from a single domain. Male and female are values from the domain of genders; unknown is not. So it would be wrong to define a domain with three values to represent the three possibilities.

But if unknown is not in the same domain as male and female, then what is it? Assuming that Pauls hypothetical example is about gender data, it is not a data value, but a condition where no data value is present. The absence of any data value. Which is almost a verbatim quote from the definition of "the null value" in the ANSI standard for SQL.

Now, NULL is not the same as unknown. NULL is the absence of any data value. Unknown is the absence of any data value for a specific reason. Unknown data should in an SQL database always be represented as NULL, but a NULL does not always represents unknown data. However, in the case proposed by Paul, I know that there are only three possible states: male, female, or gender unknown. So in this specific case, a NULL does indeed represent gender unknown, since there are no other scenarios that allow the data to be missing.

With these observations, the rest seems obvious to me. Unknown can not be represented in the same domain as male and female, but NULL can (since NULL is part of any domain in an SQL database). The domain now consists of male, female and NULL (representing the situation where the gender is unknown). This domain quickly maps to a data type and a CHECK constraint to allow only two values, and no NOT NULL constraint.

That's the logic of the first choice. But I know that this group has a lot of visitors who feel that NULL should not be in a relational database. Their response when challenged on how to represent missing data is always to break optional attributes into seperate tables, in a 1:(0-1) relationship with the original table, where the attribute now is mandatory. That simple step to avoid NULLs in tables is what brought me to the second choice.

Best, Hugo Received on Tue Oct 05 2010 - 00:53:28 CEST

Original text of this message