Re: Towards a definition of atomic

From: David BL <davidbl_at_iinet.net.au>
Date: Mon, 4 Feb 2008 17:38:49 -0800 (PST)
Message-ID: <4a2c40f7-b8e7-45ec-b588-be05a5e75e35_at_m34g2000hsf.googlegroups.com>


On Feb 4, 11:29 pm, Jan Hidders <hidd..._at_gmail.com> wrote:
> On 4 feb, 03:09, David BL <davi..._at_iinet.net.au> wrote:
>
>
>
>
>
> > On Feb 3, 11:41 pm, Jan Hidders <hidd..._at_gmail.com> wrote:
>
> > > On 2 feb, 02:45, David BL <davi..._at_iinet.net.au> wrote:
> > > > On Feb 2, 3:42 am, Jan Hidders <hidd..._at_gmail.com> wrote:
> > > > > On 1 feb, 14:30, David BL <davi..._at_iinet.net.au> wrote:
>
> > > > > > AFAIK the conventional wisdom is that no absolute definition of atomic
> > > > > > exists for domain types.
>
> > > > > On the contrary, we have too many of them. :-) But I don't think it is
> > > > > wise to go looking for a definition if it is not clear to you want you
> > > > > want to do with that definition. If you don't know where you are
> > > > > going, any road will take you there.
> > > > > > Throwing caution to the wind, in this post I
> > > > > > wish to conjecture a definition of atomic that, perhaps with some more
> > > > > > effort at its formalisation, can provide some absolute meaning for a
> > > > > > given attribute within a given RDB schema.
>
> > > > > In view of such blatant carelessness I can only respond by not heeding
> > > > > my own warning. :-)
>
> > > > > The usual intuition is that something is not atomic if it is
> > > > > decomposable into smaller parts. In this case that would mean parts
> > > > > with less information. For the case of decomposition into two
> > > > > components this leads to something like:
>
> > > > > DEFINITION: A domain D is said to be decomposable if there are domains
> > > > > D1 and D2 and functions f1 : D -> D1 and f2 : D -> D2 such that (1) f1
> > > > > and f2 are not injective and (2) <f1,f2> is injective where <f1,f2> :
> > > > > D -> (D1 x D2) is defined such that <f1,f2>(x) = (f1(x), f2(x)).
>
> > > > > Note that (1) says that each individual decomposition function loses
> > > > > information and (2) says that together they don't. However, we can
> > > > > then make the following observation:
>
> > > > > THEOREM: A domain is decomposable iff it contains more than 2 values.
>
> > > > I like your approach to defining a *non-trivial* decomposition, by
> > > > requiring that f1,f2 not be injective.
>
> > > > However, I don't think your definition captures the right idea. For
> > > > instance, my example 2 decomposes a domain expressed as a set of
> > > > person names to a single name. In that case what is D1,D2?
>
> > > The definition is aimed at only establishing that something is
> > > decomposable, not exactly how you should decompose it.
>
> > Yes, that makes sense.
>
> Thanks.
>
>
>
>
>
> > > For a set of
> > > names it clearly holds that it is decomposable and you can pick many
> > > different D1s and D2s that show this. For example, you can split the
> > > set into a set of strings before "Laura" and a set after "Laura".
> > > Whether that makes sense or not, is another question because you are
> > > looking for an absolute definition. It might, and that is enough. If
> > > you want to say more about how you want to decompose, you could define
> > > something like the following:
>
> > > DEFINITION: A domain D is said to be set-decomposable if there is a
> > > domain D' and an injective function f : D -> P(D'), where P is the
> > > power-set operator, such that the number of values in D' is smaller
> > > than the number of values of D minus 1.
>
> > > It think you will understand why D' has to be smaller than D. The
> > > "minus 1" may be a bit surprising, but otherwise the notion becomes
> > > trivial in the sense that every domain will be set-decomposable: you
> > > can map the domain {v_1, v_2, ... } to P({v_2, ...}) by mapping v_1 to
> > > the empty set and for i > 1 mapping v_i to { v_i }.
>
> > > Guess what?
>
> > > THEOREM A domain is set-decomposable iff it contains more than 3
> > > values.
>
> > > So set-decomposability implies decomposability, but not the other way
> > > around.
>
> > What about when D is an infinite domain? Talking about number of
> > values in D minus 1 doesn't seem applicable.
>
> Correct, I was a bit sloppy there. Should be more like "the
> cardinality of D should be smaller than the cardinality of D' minus
> one of its elements".
>
> > In example 2, the values to be decomposed were sets of person names.
> > So I presume the domain D to be decomposed is the power set over the
> > set of strings, D' is the set of strings and f can simply be the
> > identity. Is that right? Now D and D' both have infinite
> > cardinality. However it seems clear that D' should be deemed smaller
> > in some sense. In fact Cantor proved there is no surjection from a
> > set to its power set, which establishes the result that D' has a
> > smaller cardinality than D. In fact only D' is countably infinite.
>
> Exactly.
>
> > What do you think of the idea to use existence of a bijection as a
> > definition of information equivalence between databases (expressed
> > with alternative DB schema)?
>
> It depends a bit on what intuitive concept of equivalence you want to
> formalize. Under this definition, for example, all schemas that have
> countably infinite many instances are equivalent. That is probably too
> crude.

Way too crude! Yes, I can see the bijection needs to be constrained somehow.

Presumably the bijection exists if and only if the set of instances for each schema have the same cardinality. I gather that is in fact the definition of equal cardinality.

> This question has been addressed a number of times in the
> literature, and is by no means easy or uncontroversial. Some
> literature I can advise:
>
> Hull, R. and Yap, C.K. The Format Model: A Theory of Database
> Organization. Proceedings of the 1 st ACM Symposium on Principles of
> Database Systems, ACM, 1982, pp. 205-211.
>
> R. Hull. Relative Information Capacity of Simple Relational Database
> Schemata. SIAM Journal of Computing, 15(3), 1986.
>
> Also interesting:
>
> R. J. Miller, Y. E. Ioannidis, and R. Ramakrishnan. The Use of
> Information Capacity in Schema Integration and Translation. In Int.
> Conference on Very Large Data Bases, pages 120--133, 1993.http://citeseer.ist.psu.edu/miller93use.html
>
> Miller, R. J., Ioannidis, Y. E., and Ramakrishnan, R. 1994. Schema
> equivalence in heterogeneous systems: bridging theory and practice.
> Inf. Syst. 19, 1 (Jan. 1994), 3-31. DOI=http://dx.doi.org/10.1016/0306-4379(94)90024-8
>
> _at_techreport{ hofstede92note,
> author = "A. H. M. ter Hofstede and H. A. Proper and Th.P. van der
> Weide",
> title = "{A Note on Schema Equivalence}",
> year = "1992",
> url = "citeseer.ist.psu.edu/407910.html" }
>
> _at_inproceedings{DBLP:conf/otm/ProperW05,
> author = {Henderik Alex Proper and
> Theo P. van der Weide},
> title = {Schema Equivalence as a Counting Problem},
> booktitle = {OTM Workshops},
> year = {2005},
> pages = {730-739},
> ee = {http://dx.doi.org/10.1007/11575863_92},
> crossref = {DBLP:conf/otm/2005-1},
> bibsource = {DBLP,http://dblp.uni-trier.de}
>
> No, I don't expect you to read all these, I just want to give an idea
> of the amount of thought that has already been spent on this. :-) The
> work by Richard Hull et al. is warmly recommended though. My own
> position would be roughly what you find in that work, so the function
> should in addition be computable and generic. His and Yap's work is
> IMO the best in this area. The work by Arthur ter Hofstede, Erik
> Proper and Theo van der Weide is nice because it shows that the
> problem is essentially the same for ER-like data models such as ORM.
> Erik also wrote earlier on a paper on this with Terry Halpin himself.

Thanks for the references

> > Do you think the following is a valid principle?: Identifiers that
> > stand for mathematical values increase the entropy of the database and
> > should be regarded as representing "noise", and indicate that
> > decomposition of domain types has been taken too far.
>
> I don't think it's that simple. What definition of entropy did you
> have in mind? The ones I know that were introduced for normalization
> theory do not imply that identifiers increase entropy. What is a good
> model or not depends on many factors. It might depend on the users and
> what they want to do with the data. If it is intuitive for them, who
> are we to say otherwise? It might also depend on the DBMS and how it
> provides physical / data independence. In theory this should not
> matter, but in practice it probably does.
Received on Tue Feb 05 2008 - 02:38:49 CET

Original text of this message