Re: Modeling Data for XML instead of SQL-DBMS

From: dawn <dawnwolthuis_at_gmail.com>
Date: 25 Oct 2006 13:42:35 -0700
Message-ID: <1161808955.757381.127650_at_e3g2000cwe.googlegroups.com>


mAsterdam wrote:
> dawn wrote:
> > mAsterdam wrote:
> >> <Annotations>
> >>
> >> dawn wrote:
> >>> If working on a software project where all data are persisted
> >> /persisted/
> >>
> >> Ah, we are talking software development on an island, not
> >> about shared data.
> >
> > Sure, we could assume that if it helps.

>

> I am not sure you are aware of the implications.
>

> There are layers of abstraction here. Talking
> about persistence as such is possible when we
> abstract away from the specifics of what it is
> that is persisted.
> Talking about data as such is possible when we
> abstract away from the specifics of how it is
> persisted.

The interface to a persistence mechanism (software) is relevant to the design of a data model. When I use the term "logical data model" am using it to refer to the implementation data model. If you want to have "conceptual data model" then "logical data model" and then another "implementation data model" so be it. It is common to let "logical data model" refer to this implementation data model -- the model of the data as specified to the API used for retaining data beyond the run-time of a particular software application, for example.

> Surely there is no need for persistence when we have no data to
> persist,

Correct. I am referring to data that is persisted (stored, if preferred).

> and from the other perspective: our data is valuable so
> we need persistence for it - but that's about it; mixing the two
> topics frustrates both discussions.

The requirement to retain data beyond a particular application run-time is a requirement that mixes the two, it seems. But if it helps, I will state that I do not need to care about any physical storage device, only the interface/language/api used to pass data that is to be stored and retrive it.

> You say "Sure, we could assume that if it helps.",
> what do you mean by that - what are, to you, the
> consequences of assuming no need to share data?

The Codd folks get all hung up on the "large shared data bank" concept, so I wanted to avoid those terms if they are a stumbling block. Let's assume it need not be large. Further, let's assume that it must be shared by different software components, written in different computer languages, and that all of these are under our control to build, maintain, or license, whichever the case may be for any particular component.

> >>> in XML
> >> Ah, we are talking about documents, not about data.
> >
> > Yes, data, but data stored in documents, aka files, not employing a
> > DBMS.

>

> I'll sing it:
> Storage is irrelevant when talking about the logical data model.

Again, the requirement is that the data be available outside of one particular software component run-time, so you may call that something other than storage if you like. Logical data models are surely relevant throughout the software development process, which is why we have design patterns and best practices for data models within OOP, for example. Logical data models are also relevant when working with the interface between any particular software component and the inteface to whatever software component will handle saving and retriving such data.

> A DBMS is irrelevant when talking about the logical data model.
> Files are irrelevant when talking about the logical data model.
> Documents are irrelevant when talking about the logical data model.
> Why?
> Talking about the logical data model means talking about the
> logic of the data /itself/ , not about how, where and when
> it is stored, read or processed.

This might be a terminology issue. See above. The conceptual model is about the data itself, and if you want the logical data model to also be, then this would be the next one in line -- the implementation data model used when working between the software with data to pass to a software component that will "save" and "retrieve" the data, perhaps for some other software written in some other language.

> >>> documents and not in an SQL-DBMS, the tools would not require that the
> >>> data model be in 1NF or the use of the SQL NULL.
> >> /data model/
> >> ?? document model!
> >
> > No, it is the data model that is of interest to me,

>

> Why, then, bring the storage (files, persistence) and other
> containers (XML, SQL-DBMS) into the scope?

Because it is relevant.

> How can that possibly help?

The implementation data model for data that a software component passes to an SQL-DBMS is often quite different from the implementation data model for that same conceptual data model when software other than a SQL-DBMS is used to store and retrieve said data.

> > while the use of
> > the XML document for storage happens to be an implementation where
> > perhaps I can get the question across.

>

> What question?
> I'm trying to establish whether there really is one.

There is one in my mess of gray matter, but I have not successfully phrased the question yet in 3 years time, so thanks for your patience and help.

> Up to now it sounds somewhat like "how do the black keys sound
> different on a trumpet?".

If a company selects MySQL as a tool for interfacing with a database, putting and getting data from a secondary storage device, or a not-primarily-a-SQL-DBMS such as Berkeley-DB or Cache' as a DBMS tool, they are unlikely to have (almost certain not to have) the same implementation (aka logical) data model even starting with the same conceptual data model.

Industry best practices, as well as mathematics, have been identified related to MySQL (even if it falls short of being a pure implementation fo the relational model). So much work has been done with persistence outside of any SQL-DBMS toolset that there surely must be industry best practices for how to design/model such data repositories. Where there is some theory out there for various di-graph models, for example, I have not found anything resembling "best practices" or "design patterns" in this area.

> > My interest here is to figure
> > out what information is out there in the way of best practices for data
> > modeling outside of the SQL-DBMS environment.

>

> There is no data model /in/side the SQL-DBMS environment
> (except of course the model of such environments itself, the catalog).

Yes, perhaps it helps to think of this as a question about what the best practices are in designing the catalog or schema for a non-SQL-DBMS, one that does not enforce the (now apparently old-fashioned) notion of 1NF nor employ a 3VL, for example.

> >> - ok, no 1NF.
> >> /NULL/ (sigh of relief) - not about NULL.
> >>
> >>> How would an excellent logical data model designed for this XML
> >> /logical data model for XML/
> >> No such thing.
> >
> > Perhaps I need different terminology.

>

> Please try. Using your terms up to now above, does it sound wrong
> to you what I am trying to tell you?

Yes, but I'm trying to hear it through your ears so I can revise my terminology appropriately.

> > There must still be a design for
> > the data in these documents, perhaps specified with a dtd or xsd. I'm
> > pretty sure that some designs are better than others, so what would be
> > a good way to approach the design for this (these) data?

>

> Just like any other hierarchical implementation: you'll have to
> make decisions on what comes first and how to manage update
> anomalies - but that is /outside/ the logic of the data.

Some designs are more maintainable than others -- those would be ones that might fit into a "best practices" category.

> >>> implementation differ from the corresponding data model developed for
> >>> an SQL-DBMS?
> >> /corresponding/
> >> No real correspondence.

OK, the correspondence is this. I'm sitting here with XML documents (or choose a non-SQL-DBMS vehicle for specifying, putting, and getting data that must remain after the program writing the data is shut down) and you have Oracle in front of you. If we start with the exact same conceptual data model (UML, ORM, ERD), we are not going to end up with the exact same data model that we implement. We have some guidelines, best practices, theory, etc related to Oracle that line up somewhat closely with those for DB2, SQL Server, and Sybase. All other tools for databases (spoken loosely) are a free-for-all, without the wealth of a half-century of best practices (typically found by employing practices that are not) put in writing anywhere that I can find.

> > I will try to be more precise, but there will still be room for
> > miscommunication, I'm sure. Given a conceptual data model (not built
> > to take any particular implementation into account), how would an
> > implementation data model (aka logical data model) for XML differ from
> > an implementation data model for a SQL-DBMS?
>
> The logical model does not differ. The implementation does.

Yes, I think many writers use the term "logical data model" or LDM to refer to the model that is specified as schema, but I am fine with referring to it as the "implementation data model" if you prefer.

> > Take a simple conceptual data model for pizza orders, for example ;-)
> > perhaps specified using an old-fashioned ERD or UML (only because I'm a
> > novice with ORM).

>

> For data modeling UML is a bad fit. Though the class-diagrams have a
> sufficient graphical syntax, it's designed for software design, not
> database design.

I mentioned two other possibilities, so hopefully one meets with your approval?

> > Pizza(pizzaId, pizzaName, toppings*)
> > Customer(customerId, name, phone)
> > Order(orderId, customerId, pizzaId, addToppings*, removeToppings*)
> >
> > * zero to many values possible (maybe 1 to many in the first case, but
> > simplifying here)
> > Simplying in several ways, including not permitting 1/2 pizza toppings
> > for our purposes here
> >
> > Would there be a difference in the data model if such data were
> > designed for storage in XML documents compared to if designed for
> > storage using an SQL-DBMS?

>

> Well, the toppings, as we discussed previously, may have a meaningful
> (by itself) order, so you'd have to take special measures when
> implementing in SQL.
> With all the other stuff the order by itself does not have a meaning,
> so you'd have to take special measures for all of that when
> implementing in XML.

And with XML, it is likely that the toppings would be taken as elements under the pizza, rather than having a separate PizzaToppings table.

> >> The next (XML, but it doesn't matter)
> >> document created by dumping a database
> >
> > This would not be a database dump,

>

> Yes it would, in theory - it is one way to discuss the differences.
>

> > but the creation of an application
> > using XML documents (yes, I realize this isn't a highly scalable
> > approach to building an app, so we can assume the data volume will be
> > small if that helps set the scene -- perhaps these are pizza orders for
> > a school sale)
> >
> >> may differ from the previous, even if the (SQL, but it doesn't matter)
> >> database content stayed the same.
> >> Arbitrary order would have to be added to prevent this.
>

> This is the point of the database dump approach - I
> wasn't suggesting to actually start with a dump.
> Do you see why this is problematic?
>

> >>> What would be some best practices for modeling data in
> >>> this environment?
> >> /this environment/
> >> A marketing environment? Fantasy island?
> >
> > Yes, mAsterdam, let's say that it is a pizza sale for a school on
> > Fantasy Island.
>
> Let's eat pizza, then, and not discuss pyrotechnics.

lol

> >> </Annotations>
> >>
> >>> I'm guessing some will think that the exact same logical data model
> >>> would be appropriate for both targets, but hopefully many will agree
> >>> that it is unlikely that the best implemented data model would be
> >>> identical in each environment. In that case, what would the
> >>> differences be? What best practices would apply to data modeling for
> >>> XML documents compared to data modeling for a SQL-DBMS?
> >> Can't really answer that except "You don't".
> >> The question by it self shows to much wrongs.
> >
> > There is a question here that is legitimate,

>

> Ok, then please try to rephrase - from your blogs
> and previous discussions I know you are able to reposition
> (though I have to admit that some things are really tough to get
> through to you: constraint handling, for instance).

OK, but it is also hard for you to get your constraint handling all the way through to the end-users to maintain, right ;-)

> > even if I have not yet hit
> > the nail on the head. I've tried to ask this question a number of
> > times in various ways, as you know, but each time the question is
> > considered abberant. If we paint the scenario with what might be
> > termed an "embedded database" that is not an SQL-DBMS, perhaps
> > Berkeley-DB or Cache' or Pick so that we have the same ability to
> > remove 1NF and 3VL-style of NULL-handling, then ask the same question,
> > would it be a legitimate question? If so, then same question.
> >
> > I'm certain that over the past half-century of software development,
> > some data design patterns have shown themselves to be better than
> > others for such target environments. I'm wondering what both theorists
> > and practitioners think would be best practices when data modeling for
> > these environments.

>

> Separate the logical model design from the implementation design
> (implementation including SQL-DBMS).

I have read and worked with these definitions trying to use the most common forms, but I am OK with splitting out what I am calling a logical data model into an LDM and an implementation data model if that clears anything up. Thanks. --dawn Received on Wed Oct 25 2006 - 22:42:35 CEST

Original text of this message