Re: 3vl 2vl and NULL

From: Hugo Kornelis <hugo_at_perFact.REMOVETHIS.info>
Date: Mon, 16 Jan 2006 23:55:16 +0100
Message-ID: <ou5os1t9snurdam1roaq7gj4na2u88r41o_at_4ax.com>


Hi Jon,

Dang! Over an hour of typing has just fallen victim to a computer glitch.

Apologies in advance if I sound curt - it's not because of you, but because I hate having to write the same message a second time.

On Fri, 13 Jan 2006 10:02:57 +0100, Jon Heggland wrote:

>In article <34mds15ukofp0jgpd6nbsbqm1u2l18vbuf_at_4ax.com>,
>hugo_at_perFact.REMOVETHIS.info says...
>>
>> Codd's interpretation of NULL appears to have been changed over time. If
>> you read the "Much ado about nothing" article carefully, you'll see that
>> Codd's original commentary only uses the term "missing data" when
>> discussing NULL; the term "Unknown" is only used as the third truth
>> value.
>>
>> But if you then skip to "Rebutting the rebuttals", you'll see that Codd
>> has adopted Date's interpretation of Null as either "Unknown" or
>> "Inapplicable" - and everything starts to go downhill from there.
>
>To suggest that Codd got his I-marks and A-marks from Date is, to put it
>politely, very strange.

I didn't mean to suggest that. What I meant is:

  1. Codd's original arguments in "Much ado about nothing" are geared towards one kind of NULL: "no data here".
  2. In Codd's rebuttal, he appears to have changed position to "Two types of missing information - A-mark and I-mark".

Who or what cuased Codd's position shift, I don't know. But I do know that Codd's new position is (IMO) closer to some of Date's arguments against NULL than his first position.

>
>> In this thread, I have attepmted to defend Codd's original POV that a
>> NULL means nothing at all - it just indicates that the data that you
>> might expect to find in a column is missing.
>
>Now I think you are moving the goalposts. You said "I guess that our
>disagreements are caused by discussing two similar but not equal
>subjects. I was talking about the behaviour of NULL in SQL DBMSs [...]".
>SQL's NULL behaviour is practically identical to Codd's approach in
>"Extending the Database Relational Model to Capture More Meaning"
>(1979), where he explicitly says he is talking about null as "value at
>present unknown". Can you provide a source for "Codd's original POV",
>and show that that is the foundation for SQL?

Point taken. I should not have tried to find backing for my arguments from the big names. At least not without first checking that they really share my position.

My position in this discussion has been that relational databases need only one marker for missing data, that this marker should carry no more meaning than just that ("no value here"), that the way NULL is currently handled in SQL is very close to how I feel that this "missing data" marker should be handled, and that it would therefore be a step back, rather than a step forward, to attempt to read any meaning into NULL, other than just "no value here".

If, in any of my messages in this discussion, I appear to be defending a different position, you can blame it on my incomplete mastery of the English language.

>
>> >
>> >That's where we disagree. You can sort of salvage the predicate by
>> >treating the NULLs as unknown (but existing).
>>
>> Indeed, you can. And that would force lots of redesigns.
>
>...because people have been using NULL for "inapplicable" and other
>things, you mean? It's really not different from "forcing redesign"
>because someone has used -1 or 9999 for unknown, inapplicable or
>infinity. It may be worth it; it may not.

I have no problems with redesigns (though this sure is a huge practical objection to a shift from NULL that means "no data here" to a NULL that means "unknown but applicable"). I do have a problem with redesigns that yield no benefit.

>
>> Let's look at address data, for instance. In many Dutch databases, I
>> have seen seperate columns for the elements of a Dutch address:
>> "Address" (street, housenumber and optional suffix to the housenumber),
>> "PostalCode" and "City". And there are also columns for contacts living
>> abroad: "AddressLine1", "AddressLine2", and "AddressLine3". This
>> satisfies the many different formats used to print address lines on
>> envelopes (e.g. postal code before or after city).
>>
>> Of course, you should have two tables. One for Dutch addresses and one
>> for foreign addresses. That's the theory. But what is the added value of
>> the two-table design in a real database that is used for a real
>> application. Remember that I'm not discussing the database that the
>> postal service uses to pre-sort the mail; this database will probably
>> use the address information ONLY to print on outgoing mail. And, maybe,
>> some reports will be aggregated on City (using a COALESCE to get all
>> foreign customers in the "city" "Foreign".
>
>I don't think this is a good example, because you are mixing several
>different issues here; none of which are very relevant.
>
>* You don't mention what this has to do with NULLs. I can guess what you
>mean, but you don't need to use NULLs here to indicate that (e.g.)
>'there is no AddressLine3' (whatever that means). You can use the empty
>string.

Yes, for AddressLine3 you can. But how about HouseNumber and PostalCode?

  HouseNumber int CHECK (HouseNumber > 0),   PostalCode char(6)

              CHECK (PostalCode LIKE '[1-9][0-9][0-9][0-9] [A-Z][A-Z]',

Or StreetName?

  StreetName varchar(40),
  FOREIGN KEY (PostalCode, City, StreetName)    REFERENCES PurchasedDatabaseOfValidStreetPostalCodeCombinations

>
>* How you should model something depends greatly on how you are going to
>use it. If you are just going to print it on an envelope, a multi-line
>text field suffices. If you are going to compare addresses, or make
>detailed reports on them, or check integrity/consistency, you will
>benefit from more structure. This is completely orthogonal to NULLs. The
>model you present is not necessarily the most correct for all purposes,
>nor is it necessarily "the theory" that "you should have two tables".

Let's give the example some more beef.

  • Main use of the various address-related columns: to be printed on outgoing mail. Foreign customers are a small minority; it's essential that their address are stored and printed correct, so that the mail arrives. Less integrity checking for these addresses is acceptable.
  • The Sales department wants to do some datamining. They will relate sales and other figures to city, street, and/or postal code of our customers. Foreign customers will be excluded from this analysis.

With these requirements, I could go for one, two, or even three tables (Customers, DutchAddress, ForeignAddress). Redefining NULL to mean "unknown but applicable" would force me to dismiss at least two of my options beforehand.

>
>* Usability (for end user, not DBAs/programmers) have little to do with
>the data model---at least not in the way you seem to suggest. The
>address entry form the customer sees doesn't have to change even if you
>have two tables instead of one. In fact, I'd suggest an even better idea
>might be to model all kinds of national address schemes properly, and
>present a different form to each customers depending on the selected
>country. I for one is annoyed by having to enter "N/A" in required
>fields that are inapplicable for Norwegian addresses.

I was talking about the usability for DBAs/programmers. Which translates directly to development cost and maintenance cost.

Not a big deal in this simple example. But what about an entity with several attributes that are sometimes inapplicable - for unrelated reasons. For instance, some columns are only for females, others only for underage persons, yet others only for sports fans. And some are only for bearded male football lovers in their fourties.

NULL as "unknown but applicable" would force an abundance of tables, all related in complex ways. NULL as "no value" would not reclude that design, but leaves the door open for other designs as well.

>
>> I won't bother my customers with the extra complexity and the extra
>> performance overhead of this two-table design, even though I am aware
>> that I'm violating relational theory.
>
>Hardly. Nulls are considered by some to be part of relational database
>theory. You *might* be violating some more or less formal database
>design guidelines, but I wouldn't call that "relational theory". Or do
>you mean normalisation?

Strange. ISTR reading (in this group) that a purely relational database shouldn't have NULLs. But maybe I misread. Or paid too much attention to the words of a minority.

>
>Besides, as mentioned above, the customers don't need to notice at all.
>And performance may be better or worse, depending on how the database is
>used.

Agreed. But if performance is worse, I like to have an option. NULL as "no value here", without any other meaning, gives me an option. NULL as "unknown but applicable" doesn't.

>
>> >> >Age
>> >> >----
>> >> >NULL
>> >> >
>> >> >What does that mean?
>> >>
>> >> That row should not be in the projection on Age. NULL doesn't signal a
>> >> "NULL age" (whatever that might be), but "no data in the Age column for
>> >> this row". The project operation projects data that is there, not data
>> >> that isn't there.
>> >
>> >Yet SQL keeps that NULL in the projection, and we are discussing how SQL
>> >NULLs should be interpreted. If you interpret it as 'unknown', it makes
>> >some sort of sense: There is a family member with an unknown age.
>>
>> There is a family member for whom the age attribute is missing.
>
>So you are changing your mind? Does the row have an interpretation after
>all?

No, I'm not changing my mind. I still think that the NULL row should not be in the projection. This is one of the few cases where SQL's handling of NULL does not match to how I feel that the "missing data" marker should be handled.

But since I can't deny that SQL indeed does include this row and since you coerced me into giving an interpretation, I gave it my best shot. Which is, indeed, incompatible with my position on how NULLs *should* be treated by SQL. Like I said (above): the way NULL is currently handled in SQL is *VERY CLOSE* to how I feel that this "missing data" marker should be handled. Very close - not exactly equal. Unfortunately.

>
>(By the way, I'd complete your statement above thus: "There is a family
>member for whom the age attribute is missing---in other words, we don't
>know what it is.")

"But we do. We all know that Aunt Marge is 36. But she doesn't want it stored in the database, becuase she doesn't want her new 29-year old lover to find out."

Or, changing to a different column.

"There is a customer for whom the postal code is missing--in other words, we don't know what it is."
"But we do. It's 19Y3W4 - just look at the far right of the second foreign address line."

I think my statement was complete. Even though in an ideal world, this statement would not be made, since the NULL row would not be included in the projection. NULL is, after all, not a value. And a projection should project values.

>
>Anyway, you don't address my point: SQL keeps the row in the result, and
>we are (as you agree) discussing SQL NULLs. To rephrase, the question
>I'm discussing is "Given how NULLs work in SQL, how should they be
>interpreted?". Above, you seem to start from the other end: "Given that
>NULLs don't mean anything, how should SQL work?".

I'm doing a bit of both. See above.

I feel that NULLS in SQL are close enough to how "missing data" in relational databases should be. Sure, there are some gripes, but I can live with them. They are not severe enough to warrant a change.

>
>> >> (FYI, in NIAM, there would be one predicate like I used in my reply for
>> >> the model, and all ages used anywhere in that model would reference this
>> >> predicate.)
>> >
>> >I find this a bit strange. Wouldn't that predicate be pointless, since
>> >all natural numbers are ages?
>>
>> All natural numbers are order number too. But comparing two ages might
>> make sense, comparing an age to an order number doesn't. That's what the
>> NIAM model shows very clearly.
>
>How so? Surely NIAM doesn't prevent you from comparing values from
>different predicates?

NIAM is a modeling technique. It doesn't prevent you from anything like that. But it does offer you a model that clearly shows that the attributes you are comparing are incompatible, even though they might share the same datatype.

> I think such considerations are better handled by
>data types / domains. Doesn't NIAM separate between domains and
>predicates?

NIAM handles such things differently. I'll try to explain without changing this to an online NIAM course :-)

NIAM centers on predicates. Or, to use a literal translation of the Dutch term: FactTypes. FactTypes should be "atomic" - i.e., they can't be split without losing information.

FactTypes have fixed parts and variable parts. An example (where the notation clearly shows the fixed and variable parts):

  1. The person named George has an age of 47 years.
  2. " " " Mildred " " " " 34 " . (use fixed font to view)

This is a FactType with two roles (let's call them R1 and R2 for now) and one "FactTypeFormulation" (predicate): "The person named <R1> has an age of <R2> years."

Each role is played by either a "ThingType" or a "NameType". Think of ThingTypes as actual things (or notions) from the real world, and NameTypes as the labels we stick on these things so that we can say "George" instead of "The bald person with the brown jacket over there". In this case, both roles are played by a ThingType, as we're relating an age (which is a thing) to a person.

A ThingType is always the nominalization of a FactType - the FactType where the existance of the corresponding Things is established. For instance, the ThingType Person is the nominalization of the FactType "There exists a person named <R3>" - where R3 is a role that is played by the NameType firstname.

A Nametype is close to datatype and domain. I've explained it to my students as the collection of labels that might be stuck on a specific thing in order to be able to reference it. If we reference people by first name (as in this example), the NameType Firstname is the collection of all valid first names (which is of course endless). And if we reference ages by number of whole years, the collection of labels that we might stick on ages is a list of nonnegative integers up to whatever is a valid maximum age in the Universe of Discourse.

Clear as mud?

Best, Hugo Received on Mon Jan 16 2006 - 23:55:16 CET

Original text of this message