Re: NULLs: theoretical problems?

From: Hugo Kornelis <hugo_at_perFact.REMOVETHIS.info.INVALID>
Date: Fri, 17 Aug 2007 00:03:48 +0200
Message-ID: <3vf9c351l07qa5jf0mui4lmccp8oetkonr_at_4ax.com>


On Sun, 12 Aug 2007 19:21:30 -0700, JOG wrote:

>On Aug 8, 11:43 am, "sinister" <sinis..._at_nospam.invalid> wrote:
>> Many discussions point out one deficiency of NULLs: that they collapse
>> multiple, distinct concepts into one ("no value possible," "value missing,"
>> "value not available at this time", etc).
>>
>> What are the other theoretical problems? My impression from skimming some
>> threads in this ng is that some anomalies might occur, maybe having to do
>> with NULLs and joins, or NULLs and keys composed of more than one field, but
>> I'm not sure.
>
>A database stores true propositions. A statement with a hole in it (an
>SQL null) is not a proposition, and hence is a theoretical abhorrence.

Hi JOG,

I agree with the first statement, and disagree with the latter. Let's look at an example. Here is a database without NULLs:

(Represntation #1)

  SSN | Name
 -------+---------------

  12345 | John Brown
  90163 | Kate Wilson
  55202 | Geoff T. Hurt
 

  SSN | DOB

 -------+------------
  12345 | 1960-08-31
  55202 | 1973-01-12

This is a short representation of these true propositions:

(Representation #2)

 "The person identified by SSN 12345 goes by the name John Brown."
 "The person identified by SSN 90163 goes by the name Kate Wilson."
 "The person identified by SSN 55202 goes by the name Geoff T. Hurt."

 "The person identified by SSN 12345 was born on 1960-08-31."  "The person identified by SSN 55202 was born on 1973-01-12."

In everyday speech, propositions like these are often combined into sentences that contain multiple propositions, like this:

(Representation #3)

 "The person identified by SSN 12345 goes by the name John Brown and was born on 1960-08-31."
 "The person identified by SSN 90163 goes by the name Kate Wilson."  "The person identified by SSN 55202 goes by the name Geoff T. Hurt and was born on 1973-01-12."

In the tabular representation (note that I'm talking about paper and ink tables here, not about tables in a DB), a similar combination can be made, to save space, provide a better overview, and to match the way that real people exchange these proposition. Ths combined table would look as follows:

(Representation #4)

  SSN | Name | DOB
 -------+---------------+------------

  12345 | John Brown    | 1960-08-31
  90163 | Kate Wilson   |      -
  55202 | Geoff T. Hurt | 1973-01-12

Note the dash in the DOB column for Kate. This does not represent a birthday of "-"; it is just a symbol to signify the reader that the writer didn't accidentallly forget to fill out this cell, but rather that he or she left it empty on purpose. Since this is a very common practice, it is often not even mentioned in the legend (though it can be confusing, e.g. in a table that also uses ++, +, -, and -- symbols to signify a test rating).

I hope that we can agree that the semantics of the propositions don't change when switching between tabular and verbalised representation - in other words, representation #1 is equivalent to #2, and #3 is equivalent to #4.

I also hope that we can agree that combining five short verbalised single-proposition-sentences into three longer multi-proposition- sentences doesn't change the meaning of the propositions themselves, and that therefore representations #2 and #3 are also equivalent.

Since #1 == #2, #2 == #3 and #3 == #4, I can only conclude that #1 == #4 and that there is thus no functional difference between using a single table (with either empty space, a dash symbol, or some other symbol to represent a missing proposition) or using multiple tables where missing propositions can be represented by omitting a complete line.

Back to databases. A DB can't store a dash symbol for a missing proposition, since (a) a dash symbol is not part of each domain, and (b) the dash symbol might have a special meaning in a column (as in the rating column mentioned above). Instead, a DB uses NULL as a symbol with the same meaning aas the dash in ink-and-paper tables: "this space left empty intentionally".

The DB table storing the above information would look like this:

  SSN | Name | DOB
 -------+---------------+------------

  12345 | John Brown    | 1960-08-31
  90163 | Kate Wilson   | NULL
  55202 | Geoff T. Hurt | 1973-01-12

and this is a faithful representation of these five true propositions:

 "The person identified by SSN 12345 goes by the name John Brown."
 "The person identified by SSN 90163 goes by the name Kate Wilson."
 "The person identified by SSN 55202 goes by the name Geoff T. Hurt."

 "The person identified by SSN 12345 was born on 1960-08-31."  "The person identified by SSN 55202 was born on 1973-01-12."

The NULL in the DOB column does not represent a sixth proposition; it just marks the absence of a proposition involving the SSN 90163 and the verb "was born on".

I really fail to see how you can call this concept a "theoretical abhorrence".

>Or from a different angle you might want to consider that a relation
>is a set of tuples. A tuple must contain a value in every position, or
>it is not a valid tuple (not being a subset of the cartesian product
>of the domains being considered). Hence an SQL-null is a theoretical
>abhorrence.

Since NULL is part of any domain in an SQL database, a tuple with NULLs *is* a subset of the cartesian product of the domains being considered.

>The solution is of course to decompose a relation around its key so
>that no nullable columns are required. This results in a schema with
>more relations, and tends to produce queries with more joins, but
>without any theoretical or logical flaws.

Except of course that many such joins need to be outer joins in order to prevent data from being lost, bringiing the NULLs you just removed right back in. You can eliminate NULLs from the stored data, but not from the product.

My main probllem with this "solution", however, is that it sits rather uncomfortably between two stools. I have no problem with the choice for a DB product that propositions that share a common key can be combined into a single table. I also have no problem with the choice that each table should store just a single proposition. The former minimizes the amount of tables in the model, and the amount of joins in the applications. The latter eliminates many pesky normalisation issues, makes for a very stable data model, but comes at the price of very many tables and joins - a price that, as you already point out, is an implementation issue that we can conveniently close our eyes to in this theory group :-)

Allowing mandatory propositions to be combined but forcing optional propositions into seperate tables is halfway between these two options, with an (in my eyes) rather arbitratily chosen rule for when proposions may or may not be combined.

And the model is extremely unstable, since it's not uncommon in real businesses to see an optional proposition become mandatory (e.g. ebcause of legislation change) or a mandatory proposition become optional (e.g. because of competition). If all propositions are combined into a single table, this is a simple change from NULL to NOT NULL or vice versa. If each proposition has its own table, this is an equally simple change, the addition or removal of a constraint that SQL does not currently support but that is known as "equality constraint" in Object Role Modeling. However, if only optional propositions are moved out of the combined table, these changes result in adding or removing a table, removing or adding a column to another table, and moving lots of data to prevent data loss. This argument might be a bit too pragmatic for a theory group, but I like my data models to be a bit more stable than that!

>Some view such decomposition as being computationally inefficient,
>others reply that this is a physical implementation issue and not a
>concern of the logical model.
>
>Given the fact that it can generate much longer queries, I sometimes
>find myself allowing nulls in personal home-brew projects out of sheer
>laziness. However I do suffer from internal pangs of guilt during the
>process ;)

I hope that my explanations (if you've even managed to read this far - I did try to keep it short, honest!) have managed to easy your feeling of pain and guilt, as they are not actualy required.

Best, Hugo Received on Fri Aug 17 2007 - 00:03:48 CEST

Original text of this message