Re: Modeling Data for XML instead of SQL-DBMS

From: mAsterdam <mAsterdam_at_vrijdag.org>
Date: Sun, 29 Oct 2006 13:29:26 +0100
Message-ID: <45449ea6$0$324$e4fe514c_at_news.xs4all.nl>


dawn wrote:
> mAsterdam wrote:

>> dawn wrote:
>>> mAsterdam wrote:
>>>> dawn wrote:
>>>>> mAsterdam wrote:
>>>>>> Hierarchy (trees of nodes),
>>>>>> is what IMS, documents, file systems, XML, Lotus Domino
>>>>>> and LDAP as data containers have in common.

>>>>> I think of the XML model as a di-graph where there are trees on the
>>>>> nodes, rather than a tree of nodes (it is only an individual document
>>>>> without xlink or external code joining them).  Look at each web page
>>>>> that is written with xhtml.  It is a tree, but there are links (<a
>>>>> href...>) from one page to another that form it into a di-graph. >>>>> MV/Pick files are somewhat similar.

>>>> A unix filesystem is - at shell level - a tree.

>>> Yes, while the DBMS tools of which I am aware do not employ a tree at
>>> the "top level" in their model.  A set of XML documents may include
>>> links (one way or another) within or between them (think of XHTML).
>>> Given a set of XML documents, there is no single "node" that is "the
>>> root", even though each document has a root and can be seen as a tree.

>> You are still/again mixing data and implementation :-(

> I'll believe you could be right on that point and add that if that is
> the case, then it doesn't seem to be a huge problem in practice to do
> so. I will add that I still think I can work with designing any
> portion of a web without caring about whether it is implemented using
> an RDBMS or a distributed network with http...
>

>> I'll try to spell it, though we already did go into this in
>> quite some detail a year or so ago.

>
> Yes, I recall, so we might be on a thread where we need to agree to
> disagree on some aspect of this.

Agree to disagree on what, exactly?

I re-quoted the part you took apart with interjections:

>> The links are not (logical/user/real) data.
>> They are not. They are really really really not data. 
>> They do not even reference data. They are locators, 
>> pointing to a location, also called pointers.
>> Because they are not data, they are not part 
>> of the whole of the data.

>> Aren't they structural elements then?
>> Yes, they are part of the web (no tree) of documents and parts of >> documents. But they are not data-elements.

<With Interjections>

>> The links are not (logical/user/real) data

>
> They are identifiers.

Links identify web locations, (specific spots in) documents. Not (logical/user/real) data. They are part of something else than the universe of discourse of the data-system. IOW they are not in user-space. Links are part of the implementation. Do you agree?

> In DBMS's that have a similar data model, the links are in the schema
> (more like the XML link) rather than associated with data within an
> individual tuple (as with <a href...> )

What do you mean?
DBMS's having a similar data model to what? It is not clear to me what you are saying here.

> I think of these other specified links (e.g. in xml schema, although I
> might be too ignorant in that area) as metadata, whether specified in
> some separate metadata repository or specified in a metadata portion of
> the database itself (as rules are).

That may be the problem.
Metadata is about data.
Links are (*) about locations, not about data.

(*) You may notice that I leave out the word data here. Insofar as they are, they are not real/user/logical data, but implementation (data). It could all be rephrased in terms of two spaces with data: user space, and implementation-space but my guess is for now that would make it harder to discuss it, not easier. When really implementing I think it is unavoidable.

> For the purpose of identifying the
> mathematical form of the "data model," I don't care how these links are
> specified.

How these links are specified is not an issue.

The issue is which system they are part of. Links are not part of the "data model"
(quoted or not, and whether the mathematical form or some other form, it does not matter).

The links are not (logical/user/real) data.

> In SQL, similar metadata would be in the JOIN clauses of
> SELECTS in CREATE VIEW statements.

No. What is there is about data content, not about locations.

>> (except in the meta case
>> of-course, e.g. content-management systems - but let's keep
>> it simple).

>
> Yes, we can ignore content mgmt for this discussion.
>
>> The links are not (logical/user/real) data.
>> They are not. They are really really really not data.

>
> I look at the abstract, logical model and you seem to be looking at the
> implementation.

I believe you think you are looking at an abstract, logical model.

I think you are making a mistake, right there.

What you /are/ looking at (I can tell from your remarks) is a mix of the logical data model and implementation, to the point where you mention one and draw conclusion about the other. I hope you do agree that that is wrong. That's why I am emphasizing the implementation part of what you are looking at.

> Because the XML-web (XHTML, XML accessed via http)
> forms a highly-distributed data repository, I'll grant you your
> definitions regarding it.

What is that supposed to mean?

> But a very analogous concept is in standard
> DBMS tools (non-SQL-based) where the link specifications are metadata.

I can't tell or check.
Your remarks up to now make me take
this one with a grain of salt.

>> They do not even
>> reference data.

>
> I'm OK with you saying it this way related to html pages and a href's.
> When it comes to DBMS tools that have a somewhat similar data model,
> the link definitions specify foreign keys. I don't know if that fits
> your definition of referencing data or not.

You keep getting back to definitions.
Do links reference data (in your definitions)? If so, please show that.

>> They are locators, pointing to a location,

>
> I recall discussions about pointers and the general opinion I was left
> was is that foreign key specifications are not typically called
> pointers.

Is there a different opinion? Which one?

> The term "pointer" is more often used to store memory
> locations and "stuff" at a lower level than the metadata.

Lower level - again: on which stairway?

> Working strictly with HTML pages and the <a href ...> links, the link
> value is the foreign key value for the "tuple" (HTML document) where
> you can get more information about the linked from value.

This is about the content management system you agreed not to talk about, not about real/user/logical data.

> If that were
> a "pointer" using what I was understanding as the term, then it would
> surely at least specify a particular piece of hardware,

Great, at least /that/ implementation dependency is out with this type of locator.

Imagine discussing a weekly report:
Q: How many candybars did we sell?
A1: It is under sales, candybars: 504
A2: It is on page 4, line 6: 504

Both feel like navigating and pointing,
both give the answer, but there is a difference. A1 is valid whatever media we are using. Not A2. A2 is implementation dependent.
Not hardware dependent, but implementation dependent nevertheless.

The hardware answer A3 would be: At .254 millimeters from the start of the report, 5 cm from the top, 12 to the right on the flip side there are some ink-spots in the shape of 504. But first the question would require some serious translation.

Good to have A3 gone, no?
Now lets get back to getting A2 out.

> where a URL

Uniform Resource Locator
(http://www.officeport.com/wwwintro/urldefined.htm)

> does not -- it is just the ID value for that node.

A node in the web of documents, not in a structure of user data. It is similar to the page/line answer.
Not to the sales/candybars answer.

> It is used by the
> logical system to eventually find the right memory location, just as a
> join specification is used.
>

>> also called
>> pointers.

>
> If you wish to call a URL value a pointer, you may do so.

I do not 'wish to call' anything.
More specifically I do not 'wish to call' an URL a pointer. I did provide that as an alternate name in order to explain the similarity in working.
What does an URL do?
An URL points, just like any (other) pointer. Define as you wish, it doesn't make the similarities go away.

> You may also certainly define "pointer" to include
> foreign key specifications if you wish.

How so?

> I have been trying to avoid the term pointer when talking about
> the logical data model.

Good. It has no place there.

> One could implement a "web data model" in a
> variety of ways, including with a relational data store.

No such thing.

> The data model itself is at a higher level

I asked you before: higher level on which stairway?

> than the implementation and need not
> care about how it is implemented under the covers (other than for
> various tweaking for performance and such).

If you'd really keep it under covers, maybe - but do you? Exposure of implementation elements in places where they only confuse doesn't help.

>> Because they are not data, they are not part of the whole of the data.

>
> Data, metadata, pointer, whatever. They are part of the mix.
> Hopefully we can agree on that.

Why? Is there a point you are trying to make that depends on this mix?

Just so you know: I don't agree to mix them - but my agreement or not here is irrelevant.

>> Aren't they structural elements then?
>> Yes, they are part of the web (no tree) of documents and parts of
>> documents. But they are not data-elements.

>
> Fine. They are identifiers, the identity value for a document.
> Call it what you will.

</With Interjections>

You appear reluctant to accept:

>> The links are not (logical/user/real) data.
>> They are not. They are really really really not data. 
>> They do not even reference data. They are locators, 
>> pointing to a location, also called pointers.
>> Because they are not data, they are not part 
>> of the whole of the data.

You spent a lot of lines against this.
My guess is that something in your argumentation depends (or seems to depend) on links being part of the (logical/user/real) data. If so, what is it? Maybe that will give other clues.

Finally, back to the OP-question.

>> ... the (logical/user/real) data
>> still has to be placed somewhere in the hierarchical
>> (tree, not web) document structure.

>
> If we were to create a data repository using XML documents for a
> book-author system, we could put books in one or more XML documents and
> authors similarly. When abstracting it to the data model, I would
> include these two top-level "entities" (they each get a UML rectangle
> on a class diagram, for example). The name space could be seen as a
> top level to the metadata, but I don't think that is the hierarchy you
> are talking about. I do not have to decide whether Books are higher or
> lower than Authors in any hierarchy, nor does the Book data need to
> have any data "above it" in some hierarchy (even if the documents have
> root nodes).

One inconvenience of this implementation is the need to keep Authors.Books in sync with Books.Authors. If the data is important and we take into account that not all programs end because the program decides to end itself it becomes more than just an inconvenience.

> Are you suggesting that these data must be seen only in
> terms of a strict tree structure? I'm not catching on yet.

If the implementation is going to be hierarchical, it is a given thing that we have to see the data in one or several hierarchies, right? But no, not only.

>> This is a consequential choice you have to make,
>> because of the hierarchical nature of the implementation.
>> Characteristics of the (logical/user/real) data
>> should/could provide guidance for this designing of the hierarchy.

>
> I have never "felt" that I was designing a hierarchy when designing web
> sites or non-SQL-DBMS schema (nor IMS schema, for that matter). An ERD
> or UML diagram would not show a hierarchy to the entities. A specific
> tuple (corresponding to an "item/record" in a "file" or a document on
> the web, for example) or class of tuples is specified as a hierarchy,
> however. That is why I say that I work with trees (typically shallow
> ones) within digraphs. The UML classes form a digraph, while each UML
> rectangle specifies a tree.
>
>> I am not aware of a systematic treatment of this and similar choices,
>> though it is made - probably mostly implicitly - every day.
>>
>> [snip]

>
> Yes, there are a ton of such decisions that are made repeatedly by both
> seasoned and new developers doing logical design that is implemented in
> non-SQL-DBMS tools. I think recording some of the design trade-offs
> for maintaining such databases over time would make sense. I'm really
> surprised I have not found any such modeling discussions and suspect
> I'm just not looking in the right places.
Received on Sun Oct 29 2006 - 13:29:26 CET

Original text of this message