Re: 1NF, missing/unknown data, and structured values

From: J M Davitt <jdavitt_at_aeneas.net>
Date: Tue, 02 May 2006 05:17:16 GMT
Message-ID: <wXB5g.19291$YI5.15241_at_tornado.ohiordc.rr.com>


ralphbecket_at_gmail.com wrote:
> My understanding is that a table is in 1NF if
> - it is a set of rows (i.e., without duplicates)
> - the rows in the table all have the same type
> signature
> - a primary key (i.e., a set of column names)
> exists for the table such that there is a (partial)
> functional relationship from keys to rows (i.e.,
> each key value maps to at most one row in
> the table).

Actually, key value constraints aren't part of 1NF. Proper sets - of attributes (columns) and tuples (rows) - are what 1NF requires.

> Can someone confirm for me that there are no
> other constraints on column values under 1NF?
> In particular, is there any reason why column
> values should not themselves have structure?
> For example, dates seem like a perfectly good
> type for a column, but dates are structured
> values from which I can project day, month, or
> year.

Generally, that is correct. But read on...

> If this is so, surely other structures such as lists
> and various user-defined data types should
> also be supported, provided they support tests
> for (in)equality as required by the RM (so
> higher order types are probably out). Admittedly
> you'd probably need a richer language than SQL
> to manipulate these structure values effectively.

Yes -- and no. Interpolating user-defined types require user-defined operators. All the relational operators can operate on user-defined types as long as there exists a user-defined operator which yields a boolean when asked, "Are these two values the same value?"

> Taking this idea further, we could go on to define
> a collection of "maybe" types (to borrow from
> functional programming parlance) to handle
> situations where we have missing or inapplicable
> data, without the need to break the RM logic with
> nonsense like NULLs.

As long as user-defined operators can figure things out, anything's possible. In other words, it's perfectly acceptable for a user-defined type to use a value which it considers null to indicate something meaningful about that value. (And, of course, one null must be indistinguishable from another null of the same type.) Beyond that, there are some "common sense" requirements about getting the same (and sane) results when the same question is asked -- but those aren't RM concerns.

> That is, I could define a polymorphic type
> maybe_known(T) ranging over values of any type
> T, with two data constructors, known(X) and
> unknown, where X is a value of type T. For
> example, I could use known(age) as a column
> type to distinguish between people whose ages
> are known to the DB versus those that are not. If
> I want to, say, calculate the mean age, I must first
> filter out rows whose ages are unknown, then
> project the remaining ages from the known(age)
> values, then calculate their mean. In this way I am
> forced to ask "what is the mean age of those that
> are known?" rather than just "what is the mean
> age?" which my DB cannot answer.

Now you're treading close to the edge: not everyone agrees on the meaning of "polymorphic." But your general idea is sound: as long as the type values, operators, and results are knowable the RM can handle the data.

> The same idea can be extended further: how about
> a maybe_applicable(T) type with data constructors
> applicable(X) and not_applicable. I could even
> nest these types: the type
> maybe_applicable(maybe_known(spouse_name))
> covers
> - whether spouses are applicable (e.g., children
> can't be married)
> - spouse names where applicable, but unknown
> - spouse names where applicable and known.

"Data constructors," eh? Again, there is no general understanding of what that term means.

> I believe such an approach avoids both the
> problems with NULLs (they break everything) and
> the problem of entirely forbidding structured data
> in columns (normalisation can lead to a large
> number of tables).

As noted above, "structured" data are not forbidden. The RM specifically supports tuple-valued and relationvalued  attributes.

> Question: is this already supported in SQL and
> the various products out there? If so, why would
> anyone still use NULLs? If it is not supported,
> what is the reason?

There are a number of features in various SQL products that "sort of" support "some" of these concepts: "multisets" and "nested tables" in some products, "arrays" and "inheritance" in others. The "standard" talks about some of these, but the SQL approach is hopeless.

You're ready for Date & Darwen's "Databases, Types, and the Relational Model." Their proposed type system is inspired.

>
> Ta,
> -- Ralph
>
Received on Tue May 02 2006 - 07:17:16 CEST

Original text of this message