Re: native xml processing vs what Postgres and Oracle offer

From: rpost <rpost_at_pcwin518.campus.tue.nl>
Date: Tue, 9 Dec 2008 13:53:05 +0000 (UTC)
Message-ID: <ghlt81$1pdi$1_at_mud.stack.nl>


JOG wrote:

>On Nov 26, 9:30 am, rp..._at_pcwin518.campus.tue.nl (rpost) wrote:
[...]
>> From a relational perspective, the protocol spec should indeed have
>> been preceded by a separate logical design.  In the NNTP design, the
>> decision was made to postulate the unique identifiability of messages
>> regardless of their contents or other attributes; the alternative
>> (which most here generally advocate) is to identify entities based on
>> their attributes.
>
>While I am one of those advocates, it would be silly to ignore the
>efficiency of using a surrogate ID's in practical situations (and with
>the tools we currently have).
>
>However, to the OP: in terms of /the theory/ (this is a theory
>newsgroup after all) a good design takes a "message entity" and asks
>what it is exactly that defines its identity (what is the ID a
>surrogate for?). Note that there is no single "true" answer to this -
>that's important - because what a "message" actually is can be a whole
>host of things:
>
>1) {author, timestamp}: a message as a submission from an author at a
>certain time. If the content is edited later on it is still viewed as
>the same message as the original.

For USENET this would suffice: it does not allow messages to be edited; it allow them to be superseded, but that doesn't work well in practice. For a web-based forum it would also work. The issue for USENET is to what extent the <author,timestamp> that will actually be used can be guaranteed to be accurate, or at least unique.

>2) {author, timestamp, content}: a message as a piece of text,
>submitted by someone at a certain time. If the content is edited it is
>then viewed as a different message to the original.

This is only necessary if the same author can post multiple messages at the same time, which as far as I can see NNTP doesn't allow, and neither does web forum software. So even if messages can be edited, 1 is a better idea.

>3) {author, timestamp, parent}: a message is a position in a thread
>tree. If it is moved it is viewed as a different message. If this is
>not desired, you can still have a separate positioning table of
>course. Position just becomes a normal attribute however, and not part
>of the message's identity.

The same remark applies: even if messages can be moved (which some web forum software supports), author and timestamp suffice for identification, if they are reliable in the first place. If they don't, then adding the parent won't fix it unless the posting protocol has some really unusual properties.

There is also a fundamental objection: a parent is itself a message.

>4) {author, timestamp, content, parent}: a message is a piece of
>content at some position in a thread.

The same objections apply.

In short, I think the main issue in picking attributes and keys here is not in determining how the data is to be used, but in determining realistic commitments from the supporting software on the accuracy of the attribute values supplied. For a web forum, <author, timestamp> seems a good choice of key, even if they aren't always accurate, as long as the server never accepts multiple messages with the same author and timestamp.

-- 
Reinier
Received on Tue Dec 09 2008 - 14:53:05 CET

Original text of this message