Re: Towards a definition of atomic

From: Jan Hidders <hidders_at_gmail.com>
Date: Mon, 4 Feb 2008 06:29:27 -0800 (PST)
Message-ID: <87568004-1626-4613-8cc0-aed4a06abf1d_at_i72g2000hsd.googlegroups.com>


On 4 feb, 03:09, David BL <davi..._at_iinet.net.au> wrote:
> On Feb 3, 11:41 pm, Jan Hidders <hidd..._at_gmail.com> wrote:
>
>
>
> > On 2 feb, 02:45, David BL <davi..._at_iinet.net.au> wrote:
> > > On Feb 2, 3:42 am, Jan Hidders <hidd..._at_gmail.com> wrote:
> > > > On 1 feb, 14:30, David BL <davi..._at_iinet.net.au> wrote:
>
> > > > > AFAIK the conventional wisdom is that no absolute definition of atomic
> > > > > exists for domain types.
>
> > > > On the contrary, we have too many of them. :-) But I don't think it is
> > > > wise to go looking for a definition if it is not clear to you want you
> > > > want to do with that definition. If you don't know where you are
> > > > going, any road will take you there.
> > > > >  Throwing caution to the wind, in this post I
> > > > > wish to conjecture a definition of atomic that, perhaps with some more
> > > > > effort at its formalisation, can provide some absolute meaning for a
> > > > > given attribute within a given RDB schema.
>
> > > > In view of such blatant carelessness I can only respond by not heeding
> > > > my own warning. :-)
>
> > > > The usual intuition is that something is not atomic if it is
> > > > decomposable into smaller parts. In this case that would mean parts
> > > > with less information. For the case of decomposition into two
> > > > components this leads to something like:
>
> > > > DEFINITION: A domain D is said to be decomposable if there are domains
> > > > D1 and D2 and functions f1 : D -> D1 and f2 : D -> D2 such that (1) f1
> > > > and f2 are not injective and (2) <f1,f2> is injective where <f1,f2> :
> > > > D -> (D1 x D2) is defined such that <f1,f2>(x) = (f1(x), f2(x)).
>
> > > > Note that (1) says that each individual decomposition function loses
> > > > information and (2) says that together they don't. However, we can
> > > > then make the following observation:
>
> > > > THEOREM: A domain is decomposable iff it contains more than 2 values.
>
> > > I like your approach to defining a *non-trivial* decomposition, by
> > > requiring that f1,f2 not be injective.
>
> > > However, I don't think your definition captures the right idea.  For
> > > instance, my example 2 decomposes a domain expressed as a set of
> > > person names to a single name.  In that case what is D1,D2?
>
> > The definition is aimed at only establishing that something is
> > decomposable, not exactly how you should decompose it.
>
> Yes, that makes sense.

Thanks.

> > For a set of
> > names it clearly holds that it is decomposable and you can pick many
> > different D1s and D2s that show this.  For example, you can split the
> > set into a set of strings before "Laura" and a set after "Laura".
> > Whether that makes sense or not, is another question because you are
> > looking for an absolute definition. It might, and that is enough. If
> > you want to say more about how you want to decompose, you could define
> > something like the following:
>
> > DEFINITION: A domain D is said to be set-decomposable if there is a
> > domain D' and an injective function f : D -> P(D'), where P is the
> > power-set operator, such that the number of values in D' is smaller
> > than the number of values of D minus 1.
>
> > It think you will understand why D' has to be smaller than D. The
> > "minus 1" may be a bit surprising, but otherwise the notion becomes
> > trivial in the sense that every domain will be set-decomposable: you
> > can map the domain {v_1, v_2, ... } to P({v_2, ...}) by mapping v_1 to
> > the empty set and for i > 1 mapping v_i to { v_i }.
>
> > Guess what?
>
> > THEOREM A domain is set-decomposable iff it contains more than 3
> > values.
>
> > So set-decomposability implies decomposability, but not the other way
> > around.
>
> What about when D is an infinite domain?  Talking about number of
> values in D minus 1 doesn't seem applicable.

Correct, I was a bit sloppy there. Should be more like "the cardinality of D should be smaller than the cardinality of D' minus one of its elements".

> In example 2, the values to be decomposed were sets of person names.
> So I presume the domain D to be decomposed is the power set over the
> set of strings, D' is the set of strings and f can simply be the
> identity.  Is that right?  Now D and D' both have infinite
> cardinality.  However it seems clear that D' should be deemed smaller
> in some sense.  In fact Cantor proved there is no surjection from a
> set to its power set, which establishes the result that D' has a
> smaller cardinality than D.  In fact only D' is countably infinite.

Exactly.

> What do you think of the idea to use existence of a bijection as a
> definition of information equivalence between databases (expressed
> with alternative DB schema)?

It depends a bit on what intuitive concept of equivalence you want to formalize. Under this definition, for example, all schemas that have countably infinite many instances are equivalent. That is probably too crude. This question has been addressed a number of times in the literature, and is by no means easy or uncontroversial. Some literature I can advise:

Hull, R. and Yap, C.K. The Format Model: A Theory of Database Organization. Proceedings of the 1 st ACM Symposium on Principles of Database Systems, ACM, 1982, pp. 205-211.

R. Hull. Relative Information Capacity of Simple Relational Database Schemata. SIAM Journal of Computing, 15(3), 1986.

Also interesting:

R. J. Miller, Y. E. Ioannidis, and R. Ramakrishnan. The Use of Information Capacity in Schema Integration and Translation. In Int. Conference on Very Large Data Bases, pages 120--133, 1993. http://citeseer.ist.psu.edu/miller93use.html

Miller, R. J., Ioannidis, Y. E., and Ramakrishnan, R. 1994. Schema equivalence in heterogeneous systems: bridging theory and practice. Inf. Syst. 19, 1 (Jan. 1994), 3-31. DOI= http://dx.doi.org/10.1016/0306-4379(94)90024-8

_at_techreport{ hofstede92note,

    author = "A. H. M. ter Hofstede and H. A. Proper and Th.P. van der Weide",

    title = "{A Note on Schema Equivalence}",     year = "1992",
    url = "citeseer.ist.psu.edu/407910.html" }

_at_inproceedings{DBLP:conf/otm/ProperW05,   author = {Henderik Alex Proper and

               Theo P. van der Weide},
  title     = {Schema Equivalence as a Counting Problem},
  booktitle = {OTM Workshops},
  year      = {2005},
  pages     = {730-739},
  ee        = {http://dx.doi.org/10.1007/11575863_92},
  crossref = {DBLP:conf/otm/2005-1},
  bibsource = {DBLP, http://dblp.uni-trier.de}

No, I don't expect you to read all these, I just want to give an idea of the amount of thought that has already been spent on this. :-) The work by Richard Hull et al. is warmly recommended though. My own position would be roughly what you find in that work, so the function should in addition be computable and generic. His and Yap's work is IMO the best in this area. The work by Arthur ter Hofstede, Erik Proper and Theo van der Weide is nice because it shows that the problem is essentially the same for ER-like data models such as ORM. Erik also wrote earlier on a paper on this with Terry Halpin himself.

> Do you think the following is a valid principle?:   Identifiers that
> stand for mathematical values increase the entropy of the database and
> should be regarded as representing "noise", and indicate that
> decomposition of domain types has been taken too far.

I don't think it's that simple. What definition of entropy did you have in mind? The ones I know that were introduced for normalization theory do not imply that identifiers increase entropy. What is a good model or not depends on many factors. It might depend on the users and what they want to do with the data. If it is intuitive for them, who are we to say otherwise? It might also depend on the DBMS and how it provides physical / data independence. In theory this should not matter, but in practice it probably does.

  • Jan Hidders
Received on Mon Feb 04 2008 - 15:29:27 CET

Original text of this message