Re: boolean datatype ... wtf?

From: Brian <brian_at_selzer-software.com>
Date: Fri, 8 Oct 2010 23:12:48 -0700 (PDT)
Message-ID: <6ed9d5a3-a7c0-4092-80a1-03708045128a_at_t20g2000yqa.googlegroups.com>


On Oct 6, 5:47 pm, Hugo Kornelis
<h..._at_perFact.REMOVETHIS.info.INVALID> wrote:
> On Wed, 6 Oct 2010 11:14:17 -0700 (PDT), Brian wrote:
> >I think you misunderstand what I'm saying: whenever there is no value,
> >either there isn't supposed to be a value, or there is supposed to be
> >a value but it hasn't been supplied.
>
> No I did not misunderstand. Both "there is not supposed to be a value
> here" and "there should be a value here but it has not been supplied"
> are fine examples of reasons why a value is missing in the database
> population. Note that with "missing", I do not mean that there should be
> a value but isn't, but just there is no value. (Not being a native
> English speaker, I don't know which association is prevalent when
> English readers read "missing").
>
> My point is that this distinction is not always relevant. I'll use an
> example to explain. Let's say that the Customers table has a nullable
> column "Gender", and a CHECK constraint that allows only the values Male
> and Female. A NULL in this column can have two causes - either the
> customer is not a natural person but a legal person (there is not
> supposed to be a value for its gender), or the customer is a natural
> person but never specified his/her gender (there is supposed to be a
> value but it hasn't been supplied).
>
> Many documents I read on or linked from sites such as dbdebunk.com or
> thethirdmanifesto.com make this observation and then assume that,
> because there are multiple situations that lead to a missing, one HAS to
> somehow distinguish between those situations. But that need not be
> necessary; that depends on the business. Only the domain expert can tell
> us if, for a specific application, this distinction is important.
>
> If it is, then this has to be modeled as an extra predicate. (Okay, in
> this specific example, the extra predicate may not be required, as it
> can be derived from the predicate that distinquishes natural persons
> from legal persons - if such a predicate even exists in the model. But
> in other situations, the extra predicate would be required.) That extra
> predicate also has to allow for missing values (as there is no value
> applicable for customers where the gender is recorded).
>
> >  In other words, either the
> >predicate of the atomic formula that expresses the assertion does not
> >have a place for an individual term that corresponds to the nullable
> >column name, or there is a place but an individual variable instead of
> >an individual constant occupies that place.
>
> My background is NIAM - the Dutch counterpart of ORM (Object Role
> Modeling). NIAM centers around Fact Types and their "readings". These
> concepts are very similar to predicats, but there might be subtle
> differences I am unaware of.
> In NIAM, a missing fact can not be read. Regardless of why the fact is
> missing. For the Fact Type gender, the reading form would be Customer
> "The customer identified by customernumber <CustomerNumber> has the
> <Gender> gender". For reading an individual fact, you would replace
> <CustomerNumber> with the identifier of a customer, and <Gender> with
> his or her gender, to get readings such as "The customer identified by
> customernumber 125 has the male gender", or "The customer identified by
> customernumber 8623 has the female gender". For customer 17 (a legal
> person), this fact can not be read, as there is no value in the domain
> of <Gender> that forms the reading "The customer identified by
> customernumber 17 has no gender". But for customer 229 (a natural person
> whose gender is not recorded), we can't read this fact either, as there
> is also no value in the domain of <Gender> that forms the reading "The
> customer identified by customernumber 222 has a gender but we don't know
> which".
>
> Your text above suggests (to me) that predicates in the relational model
> accept variables as well as values, so that a predicate that is similar
> to the reading form I used above could be instantiated to read "The
> customer identified by customernumber 222 has a *variable* gender". If
> that is a correct interpretation, it sheds a new light on many things I
> have read. Please confirm if I am not misinterpretating you.
>

Let me clarify.

The extension of a predicate is the logical disjunction of all and only ground atoms that satisfy the predicate. Under the closed world assumption, the tuples in a relation map one-to-one onto the set composed of exactly those ground atoms. A ground atom is an atomic formula that does not contain individual variables as terms. A relation, therefore does not permit information to be missing. A table that allows nulls, however, is not a relation. Codd differentiated between applicable and inapplicable null markers in his 1990 book, citing applicable and inapplicable as the main reasons for missing information. An inapplicable null indicates that there is no value because there shouldn't be a value, for whatever reason. An applicable null represents a particular but unspecified value. In first-order logic, an unbound individual variable is a term that represents a particular but unspecified individual. An atomic formula consists of an n-ary predicate symbol along with an n-tuple of terms. 0-ary predicate symbols are propositions. Terms can be individual variables or individual constants. In first-order languages that allow function symbols, terms can be individual variables or m-ary function symbols along with an m-tuple of terms. 0-ary function symbols are individual constants. In either case an atomic formula can contain individual variables as terms. The atoms in a table that has two non-nullable columns and one nullable column can have one of two distinct predicates, one that accepts tuples with three terms, and one that accepts tuples with two. Where there is supposed to be a value but it hasn't been supplied, an arbitrary but distinct individual variable is placed in a tuple containing three terms; where there is not supposed to be a value, there is no place for it because the tuple contains only two terms.

Did that clarify?

> >  Why there is a variable
> >instead of a constant in that place doesn't change the fact that the
> >predicate still has a place for that term.  So whether there is
> >supposed to be a value determines the atom's predicate.
>
> Yes, I get that. This is a problem you always get when you combine
> several atomic fact types (falling back to the NIAM terminology) in a
> single table. NIAM and ORM avoid this problem by working with elementary
> fact types. If there is no information about the gender of a customer,
> there is no fact for that customer in the gender fact type.
>
> The FTD notation that prof Nijssen (one of the founders of NIAM)
> developed later removes the requirement to work on elementary fact
> types, and then solves the predicate problem by supplying multiple
> predicates for each FTD (which maps to a table in the relational model).
> So an FTD with one optional role (column) would have two predicates, one
> that includes this role and one that does not.
>
> And in relational database, the answer is to create seperate tables for
> optional attributes, or even to limit tables to at most one non-key
> column (which maps right back to the elementary fact types used in ORM
> and NIAM).
>
> >  It follows
> >that since it can't be determined just from the absence of a value
> >whether there is supposed to be a value, it must be explicitly stated
> >that there is or is not supposed to be a value.
>
> If, and only if, that distinction is relevant to the business. Otherwise
> you simply don't record the value and you're done.
>

I would argue that the distinction is always relevant because it determines which predicate applies and therefore which fact is represented.

> >  If you're only a null
> >hater, then only two relvars are needed: one with an additional
> >boolean attribute that specifies for each tuple whether there should
> >be a value or not, and another relvar for the values that have been
> >supplied.  If you're also a boolean hater, then the relvar with the
> >boolean attribute would have to be split as well.
>
> I still don't see the leap here, nor the connection with hating or not
> hating booleans.
> If the business is not interested in recording is a value is absent
> because there should not be a value or because it just happens to not be
> there, then you need either one nullable column, or one extra relvar,
> with no booleans either way.
> And if the business does require you to record this distinction than you
> either need two nullable columns, or two extra relvars. Again, with no
> booleans in either situation.
>
> You imply that it's possible to track both values and (for missing
> values) whether a missing value is applicable or not, in just two
> relvars with no nulls, but with booleans. I don't see how those relvars
> would be designed.

Table: P ( A INT NOT NULL, B INT NOT NULL, C INT NULL, D INT NULL ),

               PRIMARY KEY ( A, B ) Relations:
VAR Q { A INT, B INT, I BOOL, J BOOL } // I,J indicate C,D apply, respectively

     KEY { A, B }; VAR R { A INT, B INT, C INT } KEY { A, B }; // for supplied C values CONSTRAINT COUNT( R ) = COUNT( R JOIN ( Q WHERE I ) ); // but only where applicable

VAR S { A INT, B INT, D INT } KEY {A, B }; // for supplied D values CONSTRAINT COUNT( S ) = COUNT( S JOIN ( Q WHERE J ) ); // but only where applicable

Now imagine the relvars split due to the elimination of the introduced boolean attributes I and J.

VAR Q { A INT, B INT }
     KEY { A, B }; VAR QI { A INT, B INT } a tuple here indicates that C is applicable

     KEY { A, B };
CONSTRAINT COUNT( QI ) = COUNT( QI JOIN Q ); VAR R { A INT, B INT, C INT } KEY { A, B }; // for supplied C values CONSTRAINT COUNT( R ) = COUNT( R JOIN QI WHERE ); // but only where applicable

VAR QJ { A INT, B INT } a tuple here indicates that D is applicable

     KEY { A, B };
CONSTRAINT COUNT( QJ ) = COUNT( QJ JOIN Q ); VAR S { A INT, B INT, D INT } KEY {A, B }; // for supplied D values CONSTRAINT COUNT( S ) = COUNT( S JOIN QJ ); // but only where applicable

>
> Best, Hugo
Received on Sat Oct 09 2010 - 08:12:48 CEST

Original text of this message