Re: Towards a definition of atomic

From: David BL <davidbl_at_iinet.net.au>
Date: Sat, 2 Feb 2008 01:28:56 -0800 (PST)
Message-ID: <f90ee1eb-8b20-47b0-9b83-0969f187c697_at_d4g2000prg.googlegroups.com>


On Feb 2, 3:55 am, Marshall <marshall.spi..._at_gmail.com> wrote:
> On Feb 1, 5:30 am, David BL <davi..._at_iinet.net.au> wrote:
>
>
>
>
>
> > AFAIK the conventional wisdom is that no absolute definition of atomic
> > exists for domain types. Throwing caution to the wind, in this post I
> > wish to conjecture a definition of atomic that, perhaps with some more
> > effort at its formalisation, can provide some absolute meaning for a
> > given attribute within a given RDB schema.
>
> > The examples are a little contrived, but are only meant to be
> > illustrative.
>
> > Example 1:
> > "Einstein discovered the formula E = mc^2"
> > "Newton discovered the formula F = ma"
>
> > Example 2:
> > "Bill is a parent of { Mary, John }"
> > "Mary is a parent of { Don, Alex, Sue }"
>
> > In example 1, in Prolog we can define a predicate 'discovered' to
> > represent the two facts as follows
>
> > discovered(einstein, eq(var("E"), prod(var("m"),
> > pow(var("c"),num(2))))).
> > discovered(newton, eq(var("F),prod(var("m"),var("a")))).
>
> > In previous threads I have discussed how it is not possible to
> > decompose the information in nested expressions into a set of
> > propositions about the nodes without the introduction of node
> > identifiers.
>
> > By contrast, in example 2 it is straightforward to map the two facts
> > into five (by decomposing the sets of children) as follows
>
> > parent(bill,mary).
> > parent(bill,john).
> > parent(mary,don).
> > parent(mary,alex).
> > parent(mary,sue).
>
> I think there is a hidden assumption here at work which is affecting
> your results. We know that people are all distinct; by convention
> you are using first name as a proxy for a person, and treating the
> first name as unique; this allows you to make sets of propositions
> without concern for losing information in the process.
>
> In contrast, the nodes in an expression tree are *not* distinct
> and hence cannot be uniquely determined simply by their
> intrinsic value. (Which node am I referring to when I specify
> node "c" in the expression "E = m * c * c"?)

It's a minor point but I'm struggling a little with your terminology even though I think I know what you're getting at! ISTM that by definition of a tree, the nodes *are distinct* and this imposes the need for unique node identifiers. However, of course I agree with your point that a node cannot be identified by its intrinsic value.

> If we did not consider first name to be unique, we would have
> to introduce PersonId in your example 2 for it to work.
> Likewise, if we restrict ourselves to expression trees in which
> each number and each variable may appear only once, we
> would not need node ids in your example 1.

In view of Jan's response I've given up on any idea of defining atomic. The point I (now) want to make is that in particular examples, one can ask whether a proposed decomposition complies with information equivalence, defined in terms of a reasonably defined bijection from an original DB state to a new DB state.

I don't know whether your above observation has any bearing on the question of whether it's reasonable to classify decompositions as good or bad in this way.

It's rather vague, but my intuition tells me it's bad to have propositions that have values whose only purpose is to identify other values recorded in the same DB. This is kind of an Occam's razor argument: The sense in which node ids add complexity seems quantifiable in the increased entropy in the DB. Of course simplicity is in the eye of the beholder - witness the disagreement over the use of autonumber primary keys in another thread. My own feeling is that entropy in the DB is a very important consideration for measuring (lack of) simplicity in the logical layer, and therefore node ids are as bad as unnecessary use of autonumber attributes.

> Here is a very interesting and subtle point: node ids (and ids
> in every other such situation) are not what they might appear
> to be at first blush. They are not an artifice of set theory or
> relation theory. Rather, they are precisely and exactly the
> *identity information* of *structures.*
>
> Consider these two strings:
>
> 1*2+3
> 3*2+1
>
> Are they identical? Certainly not! The first evaluates to 5 and the
> second evaluates to 7. And yet they have exactly the same set
> of operands, the same set of operators. The characters in the
> two strings are identical, and even have the same frequency.
>
> But the structure is different.
>
> I say "structure" and not "order" because we could as well
> consider the strings as expression trees instead, and leave
> order out entirely. (Especially when considering expression
> trees of commutative operators.)
>
> If we do think of them as strings, then they are indeed structures,
> specifically ordered structures or sequences. In relational terms,
> we might write:
>
> { (4, +), (2, *), (5, 3), (1, 1), (3, 2) }
>
> (Here the order of appearance of the parenthesized pairs is
> of course immaterial.)
>
> So what are those things on the left of each pair? They
> are the structure of the expression in question! Are they
> "meaningless identifiers?" Certainly not! They are essential,
> and to remove them is to lose meaning. (Likewise, if we
> reorder the symbols in the original string we lose meaning.)

Yes I agree in that sense they are not meaningless.

However, I still find propositions that use them offensive! One of the nice features of RM is normally that each proposition is an *independent truth* that can be understood by the domain expert. That breaks down when you introduce identifiers that merely represent mathematical values, forcing you to look at other propositions in the DB to find out what it actually being stated.

> Now, in this case, the value of the identifiers of the symbols
> are used by the expression evaluator/parser, but this is
> just a distraction. Consider the evaluator for an expression
> tree expressed as a relation, which does not consider the
> values of the identifiers, and the fact that there is an isomorphism
> between them.

> > Secondly (and this is where the examples are relevant), a valid
> > decomposition must coincide with a defined bijection that maps a DB
> > state in the original schema to a DB state in the new schema. This
> > is where those node identifiers in the first example come to play,
> > because they seem to be at odds with defining such a bijection.
> > Putting it more simply, it seems that the node identifiers aren't
> > functionally dependent on the original DB state. It is for this
> > reason that one may claim that such a decomposition is unreasonable -
> > in the sense of not achieving information equivalence as a set of
> > propositions.
>
> > Unfortunately, it seems that one could be tricky and come up with a
> > bijection that makes the node identifiers functionally dependent on
> > the underlying DB state; by defining some unique ordering on the
> > identifiers (one could then use integers according to ordinal
> > position). I say this is unfortunate because it upsets the proposal
> > for a simple meaning of atomic. However, I wonder whether things can
> > be salvaged at the expense of a complicated definition of atomic, by
> > introducing a constraint on the bijection that it not be pathological,
> > in the sense that addition of information shouldn't be able to cause
> > widespread reassignment of identifiers. The very fact that this can
> > happen points to their arbitrary or meaningless nature, and more
> > specifically to the fact that they identify a sub-value rather than an
> > entity in the UoD.
>
> > Continuing with example 2, note that no further decomposition allowing
> > information equivalence is possible. For example, a person's name is
> > represented as a string domain type, and this is atomic because any
> > attempt at decomposing the string into its individual characters
> > forces the introduction of additional identifiers.
>
> The concept we are running into here is the same as "alpha
> equivalence"
> from lambda calculus.
>
> Consider these two programs:
>
> Program 1:
> let x=5;
> let y=3;
> x+y;
>
> Program 2:
> let a=5;
> let b=3;
> a+b;
>
> Are these programs identical? It depends on what we mean by
> "identical." Certainly they calculate the same value, but certainly
> that's not enough to consider them identical.
>
> If we consider the two programs as strings of symbols, then
> obviously they are different. However clearly they are very closely
> related, and we can formalize this perception, and this formalization
> is called alpha equivalence. Roughly, the two programs are alpha
> equivalent because they have identical structure if we ignore the
> specific choices of variable names. From this perspective, there is
> no significance to the specific choice of "x" or "a" or whatever as
> variable names, but that doesn't mean the names are meaningless,
> because the distinct names encode identity within the structure
> they are embedded in.

Yes. Received on Sat Feb 02 2008 - 10:28:56 CET

Original text of this message