Re: More on identifiers

From: JOG <jog_at_cs.nott.ac.uk>
Date: Sun, 7 Jun 2009 07:04:44 -0700 (PDT)
Message-ID: <3f0907eb-8de1-44d0-a3a2-59bcd5e09373_at_h2g2000yqg.googlegroups.com>


On Jun 5, 5:52 am, David BL <davi..._at_iinet.net.au> wrote:
> Informally I think of abstract identifiers as "internal glue" within a
> relational database.  A bit more formally, they are characterised as
> identifiers that could be mapped bijectively to different values
> throughout the database without changing the recorded information.
> One would therefore hope that they aren't visible to end users.
>
> I have wondered for some time whether abstract identifiers are only
> needed within the confines of a flat relational model.  The following
> hypothetical example is meant to cast some light on this question.
> I'm hoping you'll see the underlying matters of principle and see how
> it raises some interesting questions and ideas:
>
> Let a cardboard box of items land on your desk and your task is to
> record information about the items in a database.  In theory each item
> is uniquely identified by its (x,y,z) position at any given epoch.
> However, the idea is that the items are mixed up in the box and their
> positions are irrelevant to the information that needs to be recorded.
>
> The items are not labelled.  The idea is to uniquely identify them
> (only) by their observable properties.  This is indeed assumed to be
> an important integrity constraint to be enforced by the DBMS.  Note as
> well that it would be upsetting (and potentially very costly or
> impractical) if the database system forces items to be labelled when
> there shouldn't be a need to.
>
> However the problem is that's it's very difficult to shoehorn all
> these properties into a single relation in order to form a natural
> primary key.  For example only the items that are spherical have a
> radius attribute.  Only the items that are uniform in colour have a
> single colour attribute.  There may not even be a single attribute to
> be recorded that is shared by all the items!  However, the items can
> be classified in many different (and overlapping) ways.
>
> This is exactly what the RM is really good at, because information
> about a single item can be spread across many different relations,
> providing enormous flexibility.  For example, the following relations
> could be used:
>
>     wood(I) :- item I is made of wood
>     key = {I}
>
>     sphere(I,R) :- item I is a sphere with radius R
>     key = {I}
>
>     alloy(I,M,F) :- item I is made of an alloy with
>                     fraction F of metal M
>     key = {I,M}
>
> In these relations the attribute named 'I' is an abstract identifier.
>
> It seems that abstract identifiers are needed in order to utilise
> multiple relations that avoid NULLs.  They serve as the glue that ties
> things together.  In fact every such relation will use an abstract
> identifier within its primary key.

My first comment would be that the scenario you have described doesn't (strictly) necessitate abstract identifiers. Instead you'd have N relations, for the N different object types, with each of those relations also have an "in_box" attribute. You don't necessarily need a "boxes" relation to fully describe the information in full. This would mean however that you would need to query N relations to determine what was contained in any given box.

>
> In a complicated scenario there could be dozens of these relations.
> In theory one could write a very complex integrity constraint to
> verify that each item (identified by an abstract identifier) is in
> fact uniquely identified by all its observable properties.  However
> it's interesting to consider what happens when we change the
> requirements and allow for indistinguishable items in the cardboard
> box.  Of course the database using abstract identifiers will happily
> allow for duplicates.  The fact that two items can't be distinguished
> by any of their visible properties is implicit in the knowledge that
> abstract identifiers do not in themselves represent visible
> properties.  In other words the abstract identifiers can actually be
> seen as providing a curious means to encode the number of
> indistinguishable items in the cardboard box.
>
> If the schema identifies those domains involving abstract identifiers
> the DBMS could be designed to ensure that no query makes them visible
> to end users.  E.g. one can safely ask for the number of spherical
> items made of at least 10% lead because that result doesn't mention
> any abstract identifiers.  It would presumably be useful to project
> away abstract identifiers from query results.  However this is only
> chipping away at the edges of a fundamental problem.
>
> It seems desirable to hide the abstract identifiers from the schema in
> the first place to allow end users to submit queries. Also we would
> like to allow users to update the database without depending on
> abstract identifiers, or else they are going to be forced to treat
> them as real labels on the items that need to be tracked externally to
> the database.  That would mean they're not abstract identifiers
> anymore.
>
> Surely if we believe abstract identifiers aren't implicit in the
> information to be recorded we should be able to avoid them *at all
> times* when users interact with the database.  This forces us to look
> long and hard at the RM.  This leads me to the central idea I want to
> describe below:
>
> For a long time I've thought the solution is to define a language
> based on a specific grammar for how we can appropriately represent
> items with vastly different properties.  For example
>
>     ( sphere(1.0), alloy( gold(0.01), copper(0.99) ) )
>
> or
>
>     ( bolt(metric,10,50), zinc-plated )
>
> The grammar would reflect the types of items that we know we need to
> be able to identify in the problem domain.  The advantage of nested
> expressions is the ability to compose a detailed description without
> the need for abstract identifiers to tie it all together.
>
> It seems to me that this is largely the basis behind so called "semi-
> structured data", or one of the reasons for the success of XML.
>
> Evidently parts of the textual description map rather directly to
> facts that we were previously recording in relations with the help of
> abstract identifiers. It seems paradoxical that on the one hand we say
> that the RM is very flexible and perfectly suited to overlapping
> classifications and on the other hand it isn't and we need to use
> nested expressions instead.  It just doesn't make sense!
>
> This led me recently to think of a different approach all together...
>
> Firstly, rather than think of a RM database as a set of named relation
> variables, consider that it is instead regarded as a single variable
> that holds a set of named relation values.
>
> For a given abstract identifier, consider that for each every named
> relation in the database a restriction is applied in order to select
> those tuples with the matching abstract identifier, and then the
> abstract identifier attribute is projected away. What remains is a set
> of named relations that all together provides all the information
> about a single item.
>
> It is worth considering what happens to the previous three relations,
> after we have restricted + projected the database to the context of a
> particular item:
>
>     wood'() :- item is made of wood
>     key = {}
>
>     sphere'(R) :- item is a sphere with radius R
>     key = {}
>
>     alloy'(M,F) :- item is made of an alloy with
>                    fraction F of metal M
>     key = {M}
>
> The derived relation wood' has no attributes, and can signify true or
> false for the associated property according to whether or not a single
> tuple is present (DEE and DUM).
>
> The derived relation sphere' has one attribute, but it is not the
> primary key. It follows that this relation may contain at most one
> tuple.  It is similar to DEE/DUM except that the DEE case is augmented
> with additional information.
>
> Due to the projection, all the abstract identifiers have disappeared
> from every relation.  In a way, it's like seeing a database within a
> database!  The value of the "inner" database records all the facts in
> the /context/ of just one of the items, and therefore has no need for
> abstract identifiers to glue things together.
>
> We can think of this "inner" database value as a very flexible item
> descriptor.  The "outer" database value can now be seen as recording a
> bag of item descriptors - associated with a predicate stating that
> those items exist in the cardboard box. Of course more generally the
> outer database may want to think of the item descriptors as useful
> identifiers in more interesting relations.  In fact, it could be
> useful to specify in the schema that an item descriptor is the primary
> key of a relation in the outer database, because then the DBMS would
> impose an integrity constraint that the item descriptors are unique.
>
> This item descriptor is exactly the RM counterpart to a textual
> descriptor over some grammar as discussed previously.  The flexibility
> available in a grammar is equally available in the RM by virtue of
> utilising multiple named relations.  Putting it another way, we have
> properly unleashed the full potential of the RM!  The benefits of
> course are enormous.  For example, the set theoretic nature of the RM
> means we know when two item descriptors are equivalent (whereas a
> grammar creates order when it is not intended, and makes it necessary
> to flag the distinction between ordered and unordered recursive
> productions in the grammar).
>
> This suggests the idea of a DVA (Database-Valued-Attribute), where
> 'database' by definition is taken to mean a value type consisting of a
> set of named relations.  Note that we can distinguish between database
> variables, database values and database types.
>
> This reminds me of an observation I made some time ago on this
> newsgroup, that to represent a trisurface as a nested value, an RVA
> doesn't appear suitable (because a trisurface is described by /two/
> relations - i.e. a set of vertices and a set of triangles).
>
> A given database value is intended to relate to a very specific /
> context/, which could be as narrow as a single item within the
> cardboard box.  Speaking more generally, within a sufficiently narrow
> context, it can be straightforward to develop an appropriate
> relational schema without any need to go on a naming spree.  This
> database /value/ can then serve as an excellent descriptor of the item
> in a wider context.
>
> Flattening of DVAs requires introduction of abstract identifiers that
> are internal to the database, and can perhaps be seen as an
> implementation technique to map existing DBMS implementations of the
> flat relational model to a nested version that supports DVAs.
Received on Sun Jun 07 2009 - 16:04:44 CEST

Original text of this message