More on identifiers

From: David BL <davidbl_at_iinet.net.au>
Date: Thu, 4 Jun 2009 21:52:27 -0700 (PDT)
Message-ID: <03132046-3fd2-4bb1-9e00-cd81ece451c2_at_h2g2000yqg.googlegroups.com>



Informally I think of abstract identifiers as "internal glue" within a relational database. A bit more formally, they are characterised as identifiers that could be mapped bijectively to different values throughout the database without changing the recorded information. One would therefore hope that they aren't visible to end users.

I have wondered for some time whether abstract identifiers are only needed within the confines of a flat relational model. The following hypothetical example is meant to cast some light on this question. I'm hoping you'll see the underlying matters of principle and see how it raises some interesting questions and ideas:

Let a cardboard box of items land on your desk and your task is to record information about the items in a database. In theory each item is uniquely identified by its (x,y,z) position at any given epoch. However, the idea is that the items are mixed up in the box and their positions are irrelevant to the information that needs to be recorded.

The items are not labelled. The idea is to uniquely identify them (only) by their observable properties. This is indeed assumed to be an important integrity constraint to be enforced by the DBMS. Note as well that it would be upsetting (and potentially very costly or impractical) if the database system forces items to be labelled when there shouldn't be a need to.

However the problem is that's it's very difficult to shoehorn all these properties into a single relation in order to form a natural primary key. For example only the items that are spherical have a radius attribute. Only the items that are uniform in colour have a single colour attribute. There may not even be a single attribute to be recorded that is shared by all the items! However, the items can be classified in many different (and overlapping) ways.

This is exactly what the RM is really good at, because information about a single item can be spread across many different relations, providing enormous flexibility. For example, the following relations could be used:

    wood(I) :- item I is made of wood
    key = {I}

    sphere(I,R) :- item I is a sphere with radius R     key = {I}

    alloy(I,M,F) :- item I is made of an alloy with

                    fraction F of metal M
    key = {I,M}

In these relations the attribute named 'I' is an abstract identifier.

It seems that abstract identifiers are needed in order to utilise multiple relations that avoid NULLs. They serve as the glue that ties things together. In fact every such relation will use an abstract identifier within its primary key.

In a complicated scenario there could be dozens of these relations. In theory one could write a very complex integrity constraint to verify that each item (identified by an abstract identifier) is in fact uniquely identified by all its observable properties. However it's interesting to consider what happens when we change the requirements and allow for indistinguishable items in the cardboard box. Of course the database using abstract identifiers will happily allow for duplicates. The fact that two items can't be distinguished by any of their visible properties is implicit in the knowledge that abstract identifiers do not in themselves represent visible properties. In other words the abstract identifiers can actually be seen as providing a curious means to encode the number of indistinguishable items in the cardboard box.

If the schema identifies those domains involving abstract identifiers the DBMS could be designed to ensure that no query makes them visible to end users. E.g. one can safely ask for the number of spherical items made of at least 10% lead because that result doesn't mention any abstract identifiers. It would presumably be useful to project away abstract identifiers from query results. However this is only chipping away at the edges of a fundamental problem.

It seems desirable to hide the abstract identifiers from the schema in the first place to allow end users to submit queries. Also we would like to allow users to update the database without depending on abstract identifiers, or else they are going to be forced to treat them as real labels on the items that need to be tracked externally to the database. That would mean they're not abstract identifiers anymore.

Surely if we believe abstract identifiers aren't implicit in the information to be recorded we should be able to avoid them *at all times* when users interact with the database. This forces us to look long and hard at the RM. This leads me to the central idea I want to describe below:

For a long time I've thought the solution is to define a language based on a specific grammar for how we can appropriately represent items with vastly different properties. For example

( sphere(1.0), alloy( gold(0.01), copper(0.99) ) )

or

( bolt(metric,10,50), zinc-plated )

The grammar would reflect the types of items that we know we need to be able to identify in the problem domain. The advantage of nested expressions is the ability to compose a detailed description without the need for abstract identifiers to tie it all together.

It seems to me that this is largely the basis behind so called "semistructured  data", or one of the reasons for the success of XML.

Evidently parts of the textual description map rather directly to facts that we were previously recording in relations with the help of abstract identifiers. It seems paradoxical that on the one hand we say that the RM is very flexible and perfectly suited to overlapping classifications and on the other hand it isn't and we need to use nested expressions instead. It just doesn't make sense!

This led me recently to think of a different approach all together...

Firstly, rather than think of a RM database as a set of named relation variables, consider that it is instead regarded as a single variable that holds a set of named relation values.

For a given abstract identifier, consider that for each every named relation in the database a restriction is applied in order to select those tuples with the matching abstract identifier, and then the abstract identifier attribute is projected away. What remains is a set of named relations that all together provides all the information about a single item.

It is worth considering what happens to the previous three relations, after we have restricted + projected the database to the context of a particular item:

    wood'() :- item is made of wood
    key = {}

    sphere'(R) :- item is a sphere with radius R     key = {}

    alloy'(M,F) :- item is made of an alloy with

                   fraction F of metal M
    key = {M}

The derived relation wood' has no attributes, and can signify true or false for the associated property according to whether or not a single tuple is present (DEE and DUM).

The derived relation sphere' has one attribute, but it is not the primary key. It follows that this relation may contain at most one tuple. It is similar to DEE/DUM except that the DEE case is augmented with additional information.

Due to the projection, all the abstract identifiers have disappeared from every relation. In a way, it's like seeing a database within a database! The value of the "inner" database records all the facts in the /context/ of just one of the items, and therefore has no need for abstract identifiers to glue things together.

We can think of this "inner" database value as a very flexible item descriptor. The "outer" database value can now be seen as recording a bag of item descriptors - associated with a predicate stating that those items exist in the cardboard box. Of course more generally the outer database may want to think of the item descriptors as useful identifiers in more interesting relations. In fact, it could be useful to specify in the schema that an item descriptor is the primary key of a relation in the outer database, because then the DBMS would impose an integrity constraint that the item descriptors are unique.

This item descriptor is exactly the RM counterpart to a textual descriptor over some grammar as discussed previously. The flexibility available in a grammar is equally available in the RM by virtue of utilising multiple named relations. Putting it another way, we have properly unleashed the full potential of the RM! The benefits of course are enormous. For example, the set theoretic nature of the RM means we know when two item descriptors are equivalent (whereas a grammar creates order when it is not intended, and makes it necessary to flag the distinction between ordered and unordered recursive productions in the grammar).

This suggests the idea of a DVA (Database-Valued-Attribute), where 'database' by definition is taken to mean a value type consisting of a set of named relations. Note that we can distinguish between database variables, database values and database types.

This reminds me of an observation I made some time ago on this newsgroup, that to represent a trisurface as a nested value, an RVA doesn't appear suitable (because a trisurface is described by /two/ relations - i.e. a set of vertices and a set of triangles).

A given database value is intended to relate to a very specific / context/, which could be as narrow as a single item within the cardboard box. Speaking more generally, within a sufficiently narrow context, it can be straightforward to develop an appropriate relational schema without any need to go on a naming spree. This database /value/ can then serve as an excellent descriptor of the item in a wider context.

Flattening of DVAs requires introduction of abstract identifiers that are internal to the database, and can perhaps be seen as an implementation technique to map existing DBMS implementations of the flat relational model to a nested version that supports DVAs. Received on Fri Jun 05 2009 - 06:52:27 CEST

Original text of this message