Re: NULLs: theoretical problems?

From: JOG <jog_at_cs.nott.ac.uk>
Date: Fri, 17 Aug 2007 11:49:42 -0000
Message-ID: <1187351382.874948.51000_at_w3g2000hsg.googlegroups.com>


On Aug 16, 11:03 pm, Hugo Kornelis
<h..._at_perFact.REMOVETHIS.info.INVALID> wrote:
> [sni[]
> >A database stores true propositions. A statement with a hole in it (an
> >SQL null) is not a proposition, and hence is a theoretical abhorrence.
>
> Hi JOG,
>
> I agree with the first statement, and disagree with the latter. Let's
> look at an example. Here is a database without NULLs:
>
> (Represntation #1)
>
> SSN | Name
> -------+---------------
> 12345 | John Brown
> 90163 | Kate Wilson
> 55202 | Geoff T. Hurt
>
> SSN | DOB
> -------+------------
> 12345 | 1960-08-31
> 55202 | 1973-01-12
>
> This is a short representation of these true propositions:
>
> (Representation #2)
>
> "The person identified by SSN 12345 goes by the name John Brown."
> "The person identified by SSN 90163 goes by the name Kate Wilson."
> "The person identified by SSN 55202 goes by the name Geoff T. Hurt."
>
> "The person identified by SSN 12345 was born on 1960-08-31."
> "The person identified by SSN 55202 was born on 1973-01-12."
>
> In everyday speech, propositions like these are often combined into
> sentences that contain multiple propositions, like this:
>
> (Representation #3)
>
> "The person identified by SSN 12345 goes by the name John Brown and was
> born on 1960-08-31."
> "The person identified by SSN 90163 goes by the name Kate Wilson."
> "The person identified by SSN 55202 goes by the name Geoff T. Hurt and
> was born on 1973-01-12."
>
> In the tabular representation (note that I'm talking about paper and ink
> tables here, not about tables in a DB), a similar combination can be
> made, to save space, provide a better overview, and to match the way
> that real people exchange these proposition. Ths combined table would
> look as follows:
>
> (Representation #4)
>
> SSN | Name | DOB
> -------+---------------+------------
> 12345 | John Brown | 1960-08-31
> 90163 | Kate Wilson | -
> 55202 | Geoff T. Hurt | 1973-01-12
>
> Note the dash in the DOB column for Kate. This does not represent a
> birthday of "-"; it is just a symbol to signify the reader that the
> writer didn't accidentallly forget to fill out this cell, but rather
> that he or she left it empty on purpose.

....Ok, well then it would no longer be a relation. (and of course we have the extra proposition "The person with SSN 90163 did not want to supply here birthdate", which we did not have before). Either way:

  1. A table in a RDBMS is a visualization of a relation.
  2. A relation must contain a series of tuples each with the same arity.
  3. A binary tuple such as (90163, Kate Wilson) is not valid in a ternary relation. It is a theoretical impossibility.
  4. Equally values in tuples must come from the specified domains. "-" is not a date, so again (90163, Kate Wilson, -) is not a valid tuple.

Of course I understand what the "-" represents as metadata, but for the reasons above I'd stand that in RM nulls are a _theoretical_ abhorrence. Would you contest any of the above statements?

> Since this is a very common
> practice, it is often not even mentioned in the legend (though it can be
> confusing, e.g. in a table that also uses ++, +, -, and -- symbols to
> signify a test rating).
>
> I hope that we can agree that the semantics of the propositions don't
> change when switching between tabular and verbalised representation - in
> other words, representation #1 is equivalent to #2, and #3 is equivalent
> to #4.

I don't think we do agree there (if we did the logic after would make sense). Propositions are not represented in tables - they are encoded in relations, and then these relations are visualized as tables.

>
> I also hope that we can agree that combining five short verbalised
> single-proposition-sentences into three longer multi-proposition-
> sentences doesn't change the meaning of the propositions themselves, and
> that therefore representations #2 and #3 are also equivalent.
>
> Since #1 == #2, #2 == #3 and #3 == #4, I can only conclude that #1 == #4
> and that there is thus no functional difference between using a single
> table (with either empty space, a dash symbol, or some other symbol to
> represent a missing proposition) or using multiple tables where missing
> propositions can be represented by omitting a complete line.
>
> Back to databases. A DB can't store a dash symbol for a missing
> proposition, since (a) a dash symbol is not part of each domain, and (b)
> the dash symbol might have a special meaning in a column (as in the
> rating column mentioned above). Instead, a DB uses NULL as a symbol with
> the same meaning aas the dash in ink-and-paper tables: "this space left
> empty intentionally".
>
> The DB table storing the above information would look like this:
>
> SSN | Name | DOB
> -------+---------------+------------
> 12345 | John Brown | 1960-08-31
> 90163 | Kate Wilson | NULL
> 55202 | Geoff T. Hurt | 1973-01-12
>
> and this is a faithful representation of these five true propositions:
>
> "The person identified by SSN 12345 goes by the name John Brown."
> "The person identified by SSN 90163 goes by the name Kate Wilson."
> "The person identified by SSN 55202 goes by the name Geoff T. Hurt."
>
> "The person identified by SSN 12345 was born on 1960-08-31."
> "The person identified by SSN 55202 was born on 1973-01-12."
>
> The NULL in the DOB column does not represent a sixth proposition; it
> just marks the absence of a proposition involving the SSN 90163 and the
> verb "was born on".
>
> I really fail to see how you can call this concept a "theoretical
> abhorrence".
>
> >Or from a different angle you might want to consider that a relation
> >is a set of tuples. A tuple must contain a value in every position, or
> >it is not a valid tuple (not being a subset of the cartesian product
> >of the domains being considered). Hence an SQL-null is a theoretical
> >abhorrence.
>
> Since NULL is part of any domain in an SQL database, a tuple with NULLs
> *is* a subset of the cartesian product of the domains being considered.
>
> >The solution is of course to decompose a relation around its key so
> >that no nullable columns are required. This results in a schema with
> >more relations, and tends to produce queries with more joins, but
> >without any theoretical or logical flaws.
>
> Except of course that many such joins need to be outer joins in order to
> prevent data from being lost, bringiing the NULLs you just removed right
> back in. You can eliminate NULLs from the stored data, but not from the
> product.

I'm unclear why one would need an outer join? All the data is still there. Why would one ever need the kludge relation with the holes in it?

>
> My main probllem with this "solution", however, is that it sits rather
> uncomfortably between two stools. I have no problem with the choice for
> a DB product that propositions that share a common key can be combined
> into a single table. I also have no problem with the choice that each
> table should store just a single proposition. The former minimizes the
> amount of tables in the model, and the amount of joins in the
> applications. The latter eliminates many pesky normalisation issues,
> makes for a very stable data model, but comes at the price of very many
> tables and joins - a price that, as you already point out, is an
> implementation issue that we can conveniently close our eyes to in this
> theory group :-)

Heh, yup in a theory group. Outside, I would also be extremely worried about the consequences of so many joins. However I am not knowledgeable enough about the practical impacts - perhaps someone else here is? I no of no scientific studies which have really addressed the impact of 6NF.

>
> Allowing mandatory propositions to be combined but forcing optional
> propositions into seperate tables is halfway between these two options,
> with an (in my eyes) rather arbitratily chosen rule for when proposions
> may or may not be combined.
>
> And the model is extremely unstable, since it's not uncommon in real
> businesses to see an optional proposition become mandatory (e.g. ebcause
> of legislation change) or a mandatory proposition become optional (e.g.
> because of competition). If all propositions are combined into a single
> table, this is a simple change from NULL to NOT NULL or vice versa. If
> each proposition has its own table, this is an equally simple change,
> the addition or removal of a constraint that SQL does not currently
> support but that is known as "equality constraint" in Object Role
> Modeling. However, if only optional propositions are moved out of the
> combined table, these changes result in adding or removing a table,
> removing or adding a column to another table, and moving lots of data to
> prevent data loss. This argument might be a bit too pragmatic for a
> theory group, but I like my data models to be a bit more stable than
> that!

Interesting points.

>
> >Some view such decomposition as being computationally inefficient,
> >others reply that this is a physical implementation issue and not a
> >concern of the logical model.
>
> >Given the fact that it can generate much longer queries, I sometimes
> >find myself allowing nulls in personal home-brew projects out of sheer
> >laziness. However I do suffer from internal pangs of guilt during the
> >process ;)
>
> I hope that my explanations (if you've even managed to read this far - I
> did try to keep it short, honest!) have managed to easy your feeling of
> pain and guilt, as they are not actualy required.

Lol, not quite! I still have to say a coupla hail mary's.

>
> Best, Hugo
Received on Fri Aug 17 2007 - 13:49:42 CEST

Original text of this message