Re: More on identifiers
Date: Sun, 7 Jun 2009 07:04:44 -0700 (PDT)
Message-ID: <3f0907eb-8de1-44d0-a3a2-59bcd5e09373_at_h2g2000yqg.googlegroups.com>
On Jun 5, 5:52 am, David BL <davi..._at_iinet.net.au> wrote:
> Informally I think of abstract identifiers as "internal glue" within a
> relational database. A bit more formally, they are characterised as
> identifiers that could be mapped bijectively to different values
> throughout the database without changing the recorded information.
> One would therefore hope that they aren't visible to end users.
>
> I have wondered for some time whether abstract identifiers are only
> needed within the confines of a flat relational model. The following
> hypothetical example is meant to cast some light on this question.
> I'm hoping you'll see the underlying matters of principle and see how
> it raises some interesting questions and ideas:
>
> Let a cardboard box of items land on your desk and your task is to
> record information about the items in a database. In theory each item
> is uniquely identified by its (x,y,z) position at any given epoch.
> However, the idea is that the items are mixed up in the box and their
> positions are irrelevant to the information that needs to be recorded.
>
> The items are not labelled. The idea is to uniquely identify them
> (only) by their observable properties. This is indeed assumed to be
> an important integrity constraint to be enforced by the DBMS. Note as
> well that it would be upsetting (and potentially very costly or
> impractical) if the database system forces items to be labelled when
> there shouldn't be a need to.
>
> However the problem is that's it's very difficult to shoehorn all
> these properties into a single relation in order to form a natural
> primary key. For example only the items that are spherical have a
> radius attribute. Only the items that are uniform in colour have a
> single colour attribute. There may not even be a single attribute to
> be recorded that is shared by all the items! However, the items can
> be classified in many different (and overlapping) ways.
>
> This is exactly what the RM is really good at, because information
> about a single item can be spread across many different relations,
> providing enormous flexibility. For example, the following relations
> could be used:
>
> wood(I) :- item I is made of wood
> key = {I}
>
> sphere(I,R) :- item I is a sphere with radius R
> key = {I}
>
> alloy(I,M,F) :- item I is made of an alloy with
> fraction F of metal M
> key = {I,M}
>
> In these relations the attribute named 'I' is an abstract identifier.
>
> It seems that abstract identifiers are needed in order to utilise
> multiple relations that avoid NULLs. They serve as the glue that ties
> things together. In fact every such relation will use an abstract
> identifier within its primary key.
>
> In a complicated scenario there could be dozens of these relations.
> In theory one could write a very complex integrity constraint to
> verify that each item (identified by an abstract identifier) is in
> fact uniquely identified by all its observable properties. However
> it's interesting to consider what happens when we change the
> requirements and allow for indistinguishable items in the cardboard
> box. Of course the database using abstract identifiers will happily
> allow for duplicates. The fact that two items can't be distinguished
> by any of their visible properties is implicit in the knowledge that
> abstract identifiers do not in themselves represent visible
> properties. In other words the abstract identifiers can actually be
> seen as providing a curious means to encode the number of
> indistinguishable items in the cardboard box.
>
> If the schema identifies those domains involving abstract identifiers
> the DBMS could be designed to ensure that no query makes them visible
> to end users. E.g. one can safely ask for the number of spherical
> items made of at least 10% lead because that result doesn't mention
> any abstract identifiers. It would presumably be useful to project
> away abstract identifiers from query results. However this is only
> chipping away at the edges of a fundamental problem.
>
> It seems desirable to hide the abstract identifiers from the schema in
> the first place to allow end users to submit queries. Also we would
> like to allow users to update the database without depending on
> abstract identifiers, or else they are going to be forced to treat
> them as real labels on the items that need to be tracked externally to
> the database. That would mean they're not abstract identifiers
> anymore.
>
> Surely if we believe abstract identifiers aren't implicit in the
> information to be recorded we should be able to avoid them *at all
> times* when users interact with the database. This forces us to look
> long and hard at the RM. This leads me to the central idea I want to
> describe below:
>
> For a long time I've thought the solution is to define a language
> based on a specific grammar for how we can appropriately represent
> items with vastly different properties. For example
>
> ( sphere(1.0), alloy( gold(0.01), copper(0.99) ) )
>
> or
>
> ( bolt(metric,10,50), zinc-plated )
>
> The grammar would reflect the types of items that we know we need to
> be able to identify in the problem domain. The advantage of nested
> expressions is the ability to compose a detailed description without
> the need for abstract identifiers to tie it all together.
>
> It seems to me that this is largely the basis behind so called "semi-
> structured data", or one of the reasons for the success of XML.
>
> Evidently parts of the textual description map rather directly to
> facts that we were previously recording in relations with the help of
> abstract identifiers. It seems paradoxical that on the one hand we say
> that the RM is very flexible and perfectly suited to overlapping
> classifications and on the other hand it isn't and we need to use
> nested expressions instead. It just doesn't make sense!
>
> This led me recently to think of a different approach all together...
>
> Firstly, rather than think of a RM database as a set of named relation
> variables, consider that it is instead regarded as a single variable
> that holds a set of named relation values.
>
> For a given abstract identifier, consider that for each every named
> relation in the database a restriction is applied in order to select
> those tuples with the matching abstract identifier, and then the
> abstract identifier attribute is projected away. What remains is a set
> of named relations that all together provides all the information
> about a single item.
>
> It is worth considering what happens to the previous three relations,
> after we have restricted + projected the database to the context of a
> particular item:
>
> wood'() :- item is made of wood
> key = {}
>
> sphere'(R) :- item is a sphere with radius R
> key = {}
>
> alloy'(M,F) :- item is made of an alloy with
> fraction F of metal M
> key = {M}
>
> The derived relation wood' has no attributes, and can signify true or
> false for the associated property according to whether or not a single
> tuple is present (DEE and DUM).
>
> The derived relation sphere' has one attribute, but it is not the
> primary key. It follows that this relation may contain at most one
> tuple. It is similar to DEE/DUM except that the DEE case is augmented
> with additional information.
>
> Due to the projection, all the abstract identifiers have disappeared
> from every relation. In a way, it's like seeing a database within a
> database! The value of the "inner" database records all the facts in
> the /context/ of just one of the items, and therefore has no need for
> abstract identifiers to glue things together.
>
> We can think of this "inner" database value as a very flexible item
> descriptor. The "outer" database value can now be seen as recording a
> bag of item descriptors - associated with a predicate stating that
> those items exist in the cardboard box. Of course more generally the
> outer database may want to think of the item descriptors as useful
> identifiers in more interesting relations. In fact, it could be
> useful to specify in the schema that an item descriptor is the primary
> key of a relation in the outer database, because then the DBMS would
> impose an integrity constraint that the item descriptors are unique.
>
> This item descriptor is exactly the RM counterpart to a textual
> descriptor over some grammar as discussed previously. The flexibility
> available in a grammar is equally available in the RM by virtue of
> utilising multiple named relations. Putting it another way, we have
> properly unleashed the full potential of the RM! The benefits of
> course are enormous. For example, the set theoretic nature of the RM
> means we know when two item descriptors are equivalent (whereas a
> grammar creates order when it is not intended, and makes it necessary
> to flag the distinction between ordered and unordered recursive
> productions in the grammar).
>
> This suggests the idea of a DVA (Database-Valued-Attribute), where
> 'database' by definition is taken to mean a value type consisting of a
> set of named relations. Note that we can distinguish between database
> variables, database values and database types.
>
> This reminds me of an observation I made some time ago on this
> newsgroup, that to represent a trisurface as a nested value, an RVA
> doesn't appear suitable (because a trisurface is described by /two/
> relations - i.e. a set of vertices and a set of triangles).
>
> A given database value is intended to relate to a very specific /
> context/, which could be as narrow as a single item within the
> cardboard box. Speaking more generally, within a sufficiently narrow
> context, it can be straightforward to develop an appropriate
> relational schema without any need to go on a naming spree. This
> database /value/ can then serve as an excellent descriptor of the item
> in a wider context.
>
> Flattening of DVAs requires introduction of abstract identifiers that
> are internal to the database, and can perhaps be seen as an
> implementation technique to map existing DBMS implementations of the
> flat relational model to a nested version that supports DVAs.
Received on Sun Jun 07 2009 - 16:04:44 CEST