Re: native xml processing vs what Postgres and Oracle offer

From: JOG <jog_at_cs.nott.ac.uk>
Date: Wed, 26 Nov 2008 06:42:43 -0800 (PST)
Message-ID: <62e66b8c-26ed-415a-9a83-49cb22f76d0f_at_r40g2000yqj.googlegroups.com>


On Nov 26, 9:30 am, rp..._at_pcwin518.campus.tue.nl (rpost) wrote:
> patrick..._at_yahoo.com wrote:
> >I remember an interesting read a while ago by a threaded newsreader
> >author and the bottom line is the author first worked with the "in
> >reference to" part of the headers (References:) and then the subject
> >line and posting time, just due to the fact that there were so many
> >newsreaders that just one method wasn't going to cut it. While in
> >theory, using the references header you could rebuild the tree (as the
> >references would accumulate using the replied to articles list of
> >references), in practice usenet is subjected to any number of news
> >clients some being better than others.
>
> Good point; however, there is a specification (NNTP, RFC 977 and 1036)
> of the protocol, which implicitly contains a 'physical design'
> of the data structures used, in the form of requirements
> on message headers.  The misbehaving newsreaders are *broken*.
>
> From a relational perspective, the protocol spec should indeed have
> been preceded by a separate logical design.  In the NNTP design, the
> decision was made to postulate the unique identifiability of messages
> regardless of their contents or other attributes; the alternative
> (which most here generally advocate) is to identify entities based on
> their attributes.

While I am one of those advocates, it would be silly to ignore the efficiency of using a surrogate ID's in practical situations (and with the tools we currently have).

However, to the OP: in terms of /the theory/ (this is a theory newsgroup after all) a good design takes a "message entity" and asks what it is exactly that defines its identity (what is the ID a surrogate for?). Note that there is no single "true" answer to this - that's important - because what a "message" actually is can be a whole host of things:

  1. {author, timestamp}: a message as a submission from an author at a certain time. If the content is edited later on it is still viewed as the same message as the original.
  2. {author, timestamp, content}: a message as a piece of text, submitted by someone at a certain time. If the content is edited it is then viewed as a different message to the original.
  3. {author, timestamp, parent}: a message is a position in a thread tree. If it is moved it is viewed as a different message. If this is not desired, you can still have a separate positioning table of course. Position just becomes a normal attribute however, and not part of the message's identity.
  4. {author, timestamp, content, parent}: a message is a piece of content at some position in a thread.

While we may call all of the entities that these identities produce "messages", they are subtly different things (perhaps 1 might be specialized as a "post", while 3 is a "response", etc.) It is vital to pick the one at design time that suits the task in hand. If you pick the wrong one it will bite you on the ass later on.

Regards, Jim.

>
> I never analysed large sets of USENET messages with this in mind, but
> it seems pretty clear to me that this alternative would indeed have
> been superior.  E.g. assuming we can only post one messsage to an NNTP
> server at a time (as RFC 977 assumes), a message can be identified by
> a server identification (e.g. hostname) plus a timestamp.  Requiring
> the presence and correctness of these two attributes on each message
> would have been a better decision, as far as I can see now,
> than requiring the presence and uniqueness of a message ID.
>
> It would have created the problems of having to specify the permissible
> format and exact meaning of these attributes.  E.g. may the server use
> its own local clock and its own date/time format?  If it may, may it
> also reset its clock at any point in time?  I suppose IDs are so popular
> because they allow this kind of detail to be avoided.
>
> --
> Reinier
Received on Wed Nov 26 2008 - 15:42:43 CET

Original text of this message