Re: boolean datatype ... wtf?

From: Hugo Kornelis <hugo_at_perFact.REMOVETHIS.info.INVALID>
Date: Wed, 06 Oct 2010 23:47:56 +0200
Message-ID: <9gnpa69fo7r8qjijurommgqdna5svo7vtg_at_4ax.com>


On Wed, 6 Oct 2010 11:14:17 -0700 (PDT), Brian wrote:

>I think you misunderstand what I'm saying: whenever there is no value,
>either there isn't supposed to be a value, or there is supposed to be
>a value but it hasn't been supplied.

No I did not misunderstand. Both "there is not supposed to be a value here" and "there should be a value here but it has not been supplied" are fine examples of reasons why a value is missing in the database population. Note that with "missing", I do not mean that there should be a value but isn't, but just there is no value. (Not being a native English speaker, I don't know which association is prevalent when English readers read "missing").

My point is that this distinction is not always relevant. I'll use an example to explain. Let's say that the Customers table has a nullable column "Gender", and a CHECK constraint that allows only the values Male and Female. A NULL in this column can have two causes - either the customer is not a natural person but a legal person (there is not supposed to be a value for its gender), or the customer is a natural person but never specified his/her gender (there is supposed to be a value but it hasn't been supplied).

Many documents I read on or linked from sites such as dbdebunk.com or thethirdmanifesto.com make this observation and then assume that, because there are multiple situations that lead to a missing, one HAS to somehow distinguish between those situations. But that need not be necessary; that depends on the business. Only the domain expert can tell us if, for a specific application, this distinction is important.

If it is, then this has to be modeled as an extra predicate. (Okay, in this specific example, the extra predicate may not be required, as it can be derived from the predicate that distinquishes natural persons from legal persons - if such a predicate even exists in the model. But in other situations, the extra predicate would be required.) That extra predicate also has to allow for missing values (as there is no value applicable for customers where the gender is recorded).

> In other words, either the
>predicate of the atomic formula that expresses the assertion does not
>have a place for an individual term that corresponds to the nullable
>column name, or there is a place but an individual variable instead of
>an individual constant occupies that place.

My background is NIAM - the Dutch counterpart of ORM (Object Role Modeling). NIAM centers around Fact Types and their "readings". These concepts are very similar to predicats, but there might be subtle differences I am unaware of.
In NIAM, a missing fact can not be read. Regardless of why the fact is missing. For the Fact Type gender, the reading form would be Customer "The customer identified by customernumber <CustomerNumber> has the <Gender> gender". For reading an individual fact, you would replace <CustomerNumber> with the identifier of a customer, and <Gender> with his or her gender, to get readings such as "The customer identified by customernumber 125 has the male gender", or "The customer identified by customernumber 8623 has the female gender". For customer 17 (a legal person), this fact can not be read, as there is no value in the domain of <Gender> that forms the reading "The customer identified by customernumber 17 has no gender". But for customer 229 (a natural person whose gender is not recorded), we can't read this fact either, as there is also no value in the domain of <Gender> that forms the reading "The customer identified by customernumber 222 has a gender but we don't know which".

Your text above suggests (to me) that predicates in the relational model accept variables as well as values, so that a predicate that is similar to the reading form I used above could be instantiated to read "The customer identified by customernumber 222 has a *variable* gender". If that is a correct interpretation, it sheds a new light on many things I have read. Please confirm if I am not misinterpretating you.

> Why there is a variable
>instead of a constant in that place doesn't change the fact that the
>predicate still has a place for that term. So whether there is
>supposed to be a value determines the atom's predicate.

Yes, I get that. This is a problem you always get when you combine several atomic fact types (falling back to the NIAM terminology) in a single table. NIAM and ORM avoid this problem by working with elementary fact types. If there is no information about the gender of a customer, there is no fact for that customer in the gender fact type.

The FTD notation that prof Nijssen (one of the founders of NIAM) developed later removes the requirement to work on elementary fact types, and then solves the predicate problem by supplying multiple predicates for each FTD (which maps to a table in the relational model). So an FTD with one optional role (column) would have two predicates, one that includes this role and one that does not.

And in relational database, the answer is to create seperate tables for optional attributes, or even to limit tables to at most one non-key column (which maps right back to the elementary fact types used in ORM and NIAM).

> It follows
>that since it can't be determined just from the absence of a value
>whether there is supposed to be a value, it must be explicitly stated
>that there is or is not supposed to be a value.

If, and only if, that distinction is relevant to the business. Otherwise you simply don't record the value and you're done.

> If you're only a null
>hater, then only two relvars are needed: one with an additional
>boolean attribute that specifies for each tuple whether there should
>be a value or not, and another relvar for the values that have been
>supplied. If you're also a boolean hater, then the relvar with the
>boolean attribute would have to be split as well.

I still don't see the leap here, nor the connection with hating or not hating booleans.
If the business is not interested in recording is a value is absent because there should not be a value or because it just happens to not be there, then you need either one nullable column, or one extra relvar, with no booleans either way.
And if the business does require you to record this distinction than you either need two nullable columns, or two extra relvars. Again, with no booleans in either situation.

You imply that it's possible to track both values and (for missing values) whether a missing value is applicable or not, in just two relvars with no nulls, but with booleans. I don't see how those relvars would be designed.

Best, Hugo Received on Wed Oct 06 2010 - 23:47:56 CEST

Original text of this message