Re: Modeling Data for XML instead of SQL-DBMS

From: dawn <dawnwolthuis_at_gmail.com>
Date: 28 Oct 2006 17:32:02 -0700
Message-ID: <1162081922.016958.251670_at_b28g2000cwb.googlegroups.com>


mAsterdam wrote:
> dawn wrote:
> > mAsterdam wrote:
> >> dawn wrote:
> >>> mAsterdam wrote:
> >>>> Hierarchy (trees of nodes),
> >>>> is what IMS, documents, file systems, XML, Lotus Domino
> >>>> and LDAP as data containers have in common.
>
> >>> I think of the XML model as a di-graph where there are trees on the
> >>> nodes, rather than a tree of nodes (it is only an individual document
> >>> without xlink or external code joining them). Look at each web page
> >>> that is written with xhtml. It is a tree, but there are links (<a
> >>> href...>) from one page to another that form it into a di-graph.
> >>> MV/Pick files are somewhat similar.
> >> A unix filesystem is - at shell level - a tree.
> >
> > Yes, while the DBMS tools of which I am aware do not employ a tree at
> > the "top level" in their model. A set of XML documents may include
> > links (one way or another) within or between them (think of XHTML).
> > Given a set of XML documents, there is no single "node" that is "the
> > root", even though each document has a root and can be seen as a tree.
>
> You are still/again mixing data and implementation :-(

I'll believe you could be right on that point and add that if that is the case, then it doesn't seem to be a huge problem in practice to do so. I will add that I still think I can work with designing any portion of a web without caring about whether it is implemented using an RDBMS or a distributed network with http...

> I'll try to spell it, though we already did go into this in
> quite some detail a year or so ago.

Yes, I recall, so we might be on a thread where we need to agree to disagree on some aspect of this.

> The links are not (logical/user/real) data

They are identifiers.

In DBMS's that have a similar data model, the links are in the schema (more like the XML link) rather than associated with data within an individual tuple (as with <a href...> )

I think of these other specified links (e.g. in xml schema, although I might be too ignorant in that area) as metadata, whether specified in some separate metadata repository or specified in a metadata portion of the database itself (as rules are). For the purpose of identifying the mathematical form of the "data model," I don't care how these links are specified. In SQL, similar metadata would be in the JOIN clauses of SELECTS in CREATE VIEW statements.

> (except in the meta case
> of-course, e.g. content-management systems - but let's keep
> it simple).

Yes, we can ignore content mgmt for this discussion.

> The links are not (logical/user/real) data.
> They are not. They are really really really not data.

I look at the abstract, logical model and you seem to be looking at the implementation. Because the XML-web (XHTML, XML accessed via http) forms a highly-distributed data repository, I'll grant you your definitions regarding it. But a very analogous concept is in standard DBMS tools (non-SQL-based) where the link specifications are metadata.

> They do not even
> reference data.

I'm OK with you saying it this way related to html pages and a href's. When it comes to DBMS tools that have a somewhat similar data model, the link definitions specify foreign keys. I don't know if that fits your definition of referencing data or not.

> They are locators, pointing to a location,

I recall discussions about pointers and the general opinion I was left was is that foreign key specifications are not typically called pointers. The term "pointer" is more often used to store memory locations and "stuff" at a lower level than the metadata.

Working strictly with HTML pages and the <a href ...> links, the link value is the foreign key value for the "tuple" (HTML document) where you can get more information about the linked from value. If that were a "pointer" using what I was understanding as the term, then it would surely at least specify a particular piece of hardware, where a URL does not -- it is just the ID value for that node. It is used by the logical system to eventually find the right memory location, just as a join specification is used.

> also called
> pointers.

If you wish to call a URL value a pointer, you may do so. You may also certainly define "pointer" to include foreign key specifications if you wish. I have been trying to avoid the term pointer when talking about the logical data model. One could implement a "web data model" in a variety of ways, including with a relational data store. The data model itself is at a higher level than the implementation and need not care about how it is implemented under the covers (other than for various tweaking for performance and such).

> Because they are not data, they are not part of the whole of the data.

Data, metadata, pointer, whatever. They are part of the mix. Hopefully we can agree on that.

> Aren't they structural elements then?
> Yes, they are part of the web (no tree) of documents and parts of
> documents. But they are not data-elements.

Fine. They are identifiers, the identity value for a document. Call it what you will.

> To get back to the OP-question, the (logical/user/real) data
> still has to be placed somewhere in the hierarchical
> (tree, not web) document structure.

If we were to create a data repository using XML documents for a book-author system, we could put books in one or more XML documents and authors similarly. When abstracting it to the data model, I would include these two top-level "entities" (they each get a UML rectangle on a class diagram, for example). The name space could be seen as a top level to the metadata, but I don't think that is the hierarchy you are talking about. I do not have to decide whether Books are higher or lower than Authors in any hierarchy, nor does the Book data need to have any data "above it" in some hierarchy (even if the documents have root nodes). Are you suggesting that these data must be seen only in terms of a strict tree structure? I'm not catching on yet.

> This is a consequential choice you have to make,
> because of the hierarchical nature of the implementation.
> Characteristics of the (logical/user/real) data
> should/could provide guidance for this designing of the hierarchy.

I have never "felt" that I was designing a hierarchy when designing web sites or non-SQL-DBMS schema (nor IMS schema, for that matter). An ERD or UML diagram would not show a hierarchy to the entities. A specific tuple (corresponding to an "item/record" in a "file" or a document on the web, for example) or class of tuples is specified as a hierarchy, however. That is why I say that I work with trees (typically shallow ones) within digraphs. The UML classes form a digraph, while each UML rectangle specifies a tree.

> I am not aware of a systematic treatment of this and similar choices,
> though it is made - probably mostly implicitly - every day.
>
> [snip]

Yes, there are a ton of such decisions that are made repeatedly by both seasoned and new developers doing logical design that is implemented in non-SQL-DBMS tools. I think recording some of the design trade-offs for maintaining such databases over time would make sense. I'm really surprised I have not found any such modeling discussions and suspect I'm just not looking in the right places. Cheers! --dawn

Thanks. --dawn Received on Sun Oct 29 2006 - 02:32:02 CEST

Original text of this message