Re: Modeling Data for XML instead of SQL-DBMS

From: JOG <jog_at_cs.nott.ac.uk>
Date: 26 Oct 2006 02:59:44 -0700
Message-ID: <1161856784.802650.149880_at_m7g2000cwm.googlegroups.com>


dawn wrote:
> mAsterdam wrote:
> > dawn wrote:
> > ...
> > > The interface to a persistence mechanism (software) is relevant to the
> > > design of a data model. When I use the term "logical data model" am
> > > using it to refer to the implementation data model.
> >
> > Why? The way I learned it (early 80's) is this: the logical
> > model is the most complete, detailed level you can get to /without/
> > specifying the implementation plan. I don't think I should unlearn that.
>
> It took me a bit to find this since it was in a paper, rather than a
> book. I previouly defined the logical data model to be independent of
> any particular implementation, but decided to change my own def to
> align with what I read from Pascal.
>
> Pascal, Unmuddling Modeling, Feb 2005 (there might be an updated
> version, but this is the latest one that I have). He says things like
> this form p.12.
>
> "Database representations of the formal versions of business
> rules/conceptual models as mapped via the data model are called logical
> models."
>
>
> > At that time the idea was to completely design the logic
> > before even thinking about implementation - nobody really did it that
> > way, but it was the accepted 'how it /should/ be done'. Later
> > method(ologie)s addressed more technical issues up front, in the form
> > of architecture templates. That is the good and natural
> > evolution - theory closer to practice - but it does not give an
> > excuse to blur the distinction logical versus implementation.
> >
> > > If you want to have "conceptual data model" then
> > > "logical data model" and then another
> > > "implementation data model" so be it.
> >
> > The further away from cut-over-day, the more freedom there is
> > in deciding what deliverables are really necessary - it also
> > depends on culture, the size and nature of the project, the user base,
> > the project team size, management style.
> > Any one-set-of-deliverables-fits-all-projects theory is
> > doomed to stay just that: theory.
> >
> > > It is common to let "logical
> > > data model" refer to this implementation data model -- the model of the
> > > data as specified to the API used for retaining data beyond the
> > > run-time of a particular software application, for example.
> >
> > In which circles? Can you provide a reference?
>
> I think it comes from the Date/Darwin/Pascal side of the house, but I
> at this point I'm just looking at Pascal's paper to verify that (so
> Date and Darwin might suggest otherwise).
>
> > >> Surely there is no need for persistence when we have no data to
> > >> persist,
> > >
> > > Correct. I am referring to data that is persisted (stored, if
> > > preferred).
> > >
> > >> and from the other perspective: our data is valuable so
> > >> we need persistence for it - but that's about it; mixing the two
> > >> topics frustrates both discussions.
> > >
> > > The requirement to retain data beyond a particular application run-time
> > > is a requirement that mixes the two, it seems.
> >
> > I don't think so. I think it is where /sharing/ starts - as soon
> > as the next run-time incarnation may differ from an earlier one.
>
> OK, if that is how you define "sharing" then yes, the data are to be
> shared.
>
> > > But if it helps, I will
> > > state that I do not need to care about any physical storage device,
> > > only the interface/language/api used to pass data that is to be stored
> > > and retrive it.
> >
> > Ok, that does narrow it down and makes more sense to me.
> > Though I have seen quite a lot of
> > those interface/language/api -s I have yet to see a foundational
> > approach to them outside the relational school, the Codd/Date folks.
>
> I have read some of the papers that Jan pointed me to related to
> functional databases as well as di-graphs. I don't recall taking away
> anything from those in the form of "best practices" in data modeling
> derived from theory, although there might be some. As best I know,
> nothing has made its way to those who are actually designing data
> models for such environments. They typically rely on learning from
> their own mistakes and those of people around them.
>
> I think we could gain a bit by learning more broadly in this area as a
> profession. There are some roadblocks, however, including the fact
> that those doing data modeling for XML or any non-SQL-DBMS
> "persistence" typically know they are not doing what is taught in
> colleges, so they do it in back rooms with the lights dim.
> Additionally, most of those who do such data design are also doing OOP
> and other application programming design -- so they hold discussions
> with other app developers, not in "data modeling" or "database"
> circles.
>
> > >> You say "Sure, we could assume that if it helps.",
> > >> what do you mean by that - what are, to you, the
> > >> consequences of assuming no need to share data?
> > >
> > > The Codd folks get all hung up on the "large shared data bank" concept,
> > > so I wanted to avoid those terms if they are a stumbling block. Let's
> > > assume it need not be large. Further, let's assume that it must be
> > > shared by different software components, written in different computer
> > > languages, and that all of these are under our control to build,
> > > maintain, or license, whichever the case may be for any particular
> > > component.
> >
> > One approach I've seen is 'IO-modules' (about 100 man years early
> > 80's, COBOL and Assembler, ISAM for storage): one team building only
> > IO-stuff, no IO is done without the use of their modules.
>
> That is typical from what I have seen too.
>
> > I am positive they 'd love to have had a DBMS.
>
> The ones I know think they do have a DBMS. I didn't think it was a
> DBMS when I first saw it, but more than a decade into it, it is clear
> that it provides database management, even if not the same set of
> features as a SQL-DBMS.
>
> > They had a detailed
> > logical model for reference. Actually the system has since been
> > replaced by one built, with a case-tool, around an SQL-database
> > - one key starting point was the detailed logical model of
> > the old system.
> >
> > >>>>> in XML
> > >>>> Ah, we are talking about documents, not about data.
> > >>> Yes, data, but data stored in documents, aka files, not employing a
> > >>> DBMS.
> > >> I'll sing it:
> > >> Storage is irrelevant when talking about the logical data model.
> > >
> > > Again, the requirement is that the data be available outside of one
> > > particular software component run-time, so you may call that something
> > > other than storage if you like. Logical data models are surely
> > > relevant throughout the software development process, which is why we
> > > have design patterns and best practices for data models within OOP, for
> > > example. Logical data models are also relevant when working with the
> > > interface between any particular software component and the inteface to
> > > whatever software component will handle saving and retriving such data.
> > >
> > >> A DBMS is irrelevant when talking about the logical data model.
> > >> Files are irrelevant when talking about the logical data model.
> > >> Documents are irrelevant when talking about the logical data model.
> > >> Why?
> > >> Talking about the logical data model means talking about the
> > >> logic of the data /itself/ , not about how, where and when
> > >> it is stored, read or processed.
> > >
> > > This might be a terminology issue. See above. The conceptual model is
> > > about the data itself, and if you want the logical data model to also
> > > be, then this would be the next one in line -- the implementation data
> > > model used when working between the software with data to pass to a
> > > software component that will "save" and "retrieve" the data, perhaps
> > > for some other software written in some other language.
> >
> > Ok. I think I commented on that, earlier. Accepting the 'change of
> > terminology' - could you attempt a rephrase?
>
> I'm missing what you want rephrased. My question is about the
> implementation data model that I was calling the logical data model,
> only because I wanted to align with relational theorists terminology.
> The tools/languages/apis/interfaces used to put and get data from
> application software are relevant to this implementation data model.
>
> > >>>>> documents and not in an SQL-DBMS, the tools would not require that the
> > >>>>> data model be in 1NF or the use of the SQL NULL.
> > >>>> /data model/
> > >>>> ?? document model!
> > >>> No, it is the data model that is of interest to me,
> > >> Why, then, bring the storage (files, persistence) and other
> > >> containers (XML, SQL-DBMS) into the scope?
> > >
> > > Because it is relevant.
> >
> > To the implementation. But relevant, definitely.
> > (Just to remind you that I do /not/ accept the terminology of
> > the original question :-)
>
> I tried to at least roughly align my terminology in my The Naked Model
> blog entry with Pascal's from his Unmuddling modeling paper. I can
> adapt to another approach. I have heard some call the implementation
> model the physical model, but that is lower in the defs I'm working
> with. I don't think we have cdm, ldm, pdm or implementation data model
> in our glossary, but I'm not looking at it to verify that. Is there
> someone who has laid out defs that you like so I can start there in
> forming the question?
>
> > >> How can that possibly help?
> >
> > > The implementation data model for data that a software component passes
> > > to an SQL-DBMS is often quite different from the implementation data
> > > model for that same conceptual data model when software other than a
> > > SQL-DBMS is used to store and retrieve said data.
> >
> > That implementation is relevant is no reason for mixing
> > it with logical model issues. You say you accept the terminology
> > change, but you seem reluctant to do away with the old one.
>
> I tried to change to say "implementation data model" instead of
> "logical data model."
>
> > >>> while the use of
> > >>> the XML document for storage happens to be an implementation where
> > >>> perhaps I can get the question across.
> > >> What question?
> > >> I'm trying to establish whether there really is one.
> > >
> > > There is one in my mess of gray matter, but I have not successfully
> > > phrased the question yet in 3 years time, so thanks for your patience
> > > and help.
> >
> > No problem. Maybe this helps, maybe not - I hope it does:
> > I really think you have a valid question there somewhere
>
> Yipee -- thanks!
>
> > - A very unlucky combination of bad experience with
> > big RDBMS-projects and unnecessary antagonizing
> > (especially against relational zealots) choices of words
> > and approaches does not help you to get it out.
> >
> > >> Up to now it sounds somewhat like "how do the black keys sound
> > >> different on a trumpet?".
> > >
> > > If a company selects MySQL as a tool for interfacing with a database,
> > > putting and getting data from a secondary storage device, or a
> > > not-primarily-a-SQL-DBMS such as Berkeley-DB or Cache' as a DBMS tool,
> > > they are unlikely to have (almost certain not to have) the same
> > > implementation (aka logical) data model even starting with the same
> > > conceptual data model.
> >
> > Reluctant ...
> >
> > > Industry best practices, as well as mathematics, have been identified
> > > related to MySQL (even if it falls short of being a pure implementation
> > > fo the relational model). So much work has been done with persistence
> > > outside of any SQL-DBMS toolset that there surely must be industry best
> > > practices for how to design/model such data repositories. Where there
> > > is some theory out there for various di-graph models, for example, I
> > > have not found anything resembling "best practices" or "design
> > > patterns" in this area.
> >
> > Neither did I.
> >
> > >>> My interest here is to figure
> > >>> out what information is out there in the way of best practices for data
> > >>> modeling outside of the SQL-DBMS environment.
> > >> There is no data model /in/side the SQL-DBMS environment
> > >> (except of course the model of such environments itself, the catalog).
> > >
> > > Yes, perhaps it helps to think of this as a question about what the
> > > best practices are in designing the catalog or schema for a
> > > non-SQL-DBMS, one that does not enforce the (now apparently
> > > old-fashioned) notion of 1NF nor employ a 3VL, for example.
> > > ...
> > >>>>> How would an excellent logical data model designed for this XML
> > >>>> /logical data model for XML/
> > >>>> No such thing.
> > >>> Perhaps I need different terminology.
> > >> Please try. Using your terms up to now above, does it sound wrong
> > >> to you what I am trying to tell you?
> > >
> > > Yes,
> >
> > What does sound wrong to you?
>
> I had adjusted to using the term "logical data model" for the
> "implementation model" for a reason -- it took me a bit to accept this
> revision of the terms, so now I'm reverting back, so that is one thing
> that "sounds wrong." The other is what I have heard from you and
> others before, where there is some huge reservation in discussing
> storage/persistence/saving data to a secondary storage device for
> retrieval at a later time in the same sentence as mentioning a logical
> data model. Maybe these issues are tied together and a simple return
> to my prior understanding of a logical data model, where it is
> independent of any specific tools (such as an SQL-DBMS) would clear it
> up.
>
> In your def of a logical data model, are you careful not to bring in
> implementation-specific matters such as the way any target environment
> might handle nulls, for example?
>
> > > but I'm trying to hear it through your ears so I can revise my
> > > terminology appropriately.
> > >
> > >>> There must still be a design for
> > >>> the data in these documents, perhaps specified with a dtd or xsd. I'm
> > >>> pretty sure that some designs are better than others, so what would be
> > >>> a good way to approach the design for this (these) data?
> > >> Just like any other hierarchical implementation: you'll have to
> > >> make decisions on what comes first and how to manage update
> > >> anomalies - but that is /outside/ the logic of the data.
> > >
> > > Some designs are more maintainable than others -- those would be ones
> > > that might fit into a "best practices" category.
> > >
> > >>>>> implementation differ from the corresponding data model developed for
> > >>>>> an SQL-DBMS?
> > >>>> /corresponding/
> > >>>> No real correspondence.
> > >
> > > OK, the correspondence is this. I'm sitting here with XML documents
> > > (or choose a non-SQL-DBMS vehicle for specifying, putting, and getting
> > > data that must remain after the program writing the data is shut down)
> > > and you have Oracle in front of you. If we start with the exact same
> > > conceptual data model (UML, ORM, ERD), we are not going to end up with
> > > the exact same data model that we implement. We have some guidelines,
> > > best practices, theory, etc related to Oracle that line up somewhat
> > > closely with those for DB2, SQL Server, and Sybase. All other tools
> > > for databases (spoken loosely) are a free-for-all, without the wealth
> > > of a half-century of best practices (typically found by employing
> > > practices that are not) put in writing anywhere that I can find.
> > >
> > >>> I will try to be more precise, but there will still be room for
> > >>> miscommunication, I'm sure. Given a conceptual data model (not built
> > >>> to take any particular implementation into account), how would an
> > >>> implementation data model (aka logical data model) for XML differ from
> > >>> an implementation data model for a SQL-DBMS?
> > >> The logical model does not differ. The implementation does.
> > >
> > > Yes, I think many writers use the term "logical data model" or LDM to
> > > refer to the model that is specified as schema, but I am fine with
> > > referring to it as the "implementation data model" if you prefer.
> >
> > Rel ...
>
> ?
>
> > >>> Take a simple conceptual data model for pizza orders, for example ;-)
> > >>> perhaps specified using an old-fashioned ERD or UML (only because I'm a
> > >>> novice with ORM).
> > >> For data modeling UML is a bad fit. Though the class-diagrams have a
> > >> sufficient graphical syntax, it's designed for software design, not
> > >> database design.
> > >
> > > I mentioned two other possibilities, so hopefully one meets with your
> > > approval?
> >
> > Heh. I'll grab this opportunity to ramble. The graphical syntaxes, as
> > long as they are rich enough, do not really matter.
> >
> > I think there are two basic approaches to conceptual and logical
> > modeling: /thing-/ and /fact-/ thinking. ERM is closer to
> > /thing/-think, ORM is closer to /fact/-think (and more detailed).
> > Though I personally (non-rationally, it just feels closer to how I
> > think I think) prefer thinking about facts,
> > I have seen some quite good systems around ERM (things).
>
> I like the ORM approach, but am not well-versed in it. ERM (using
> ERD's) and UML are both comfortable to me for conceptual modeling,
> preferably on the back of napkins or on a white board and definitely in
> combination with a glossary.
>
> > >>> Pizza(pizzaId, pizzaName, toppings*)
> > >>> Customer(customerId, name, phone)
> > >>> Order(orderId, customerId, pizzaId, addToppings*, removeToppings*)
> > >>>
> > >>> * zero to many values possible (maybe 1 to many in the first case, but
> > >>> simplifying here)
> > >>> Simplying in several ways, including not permitting 1/2 pizza toppings
> > >>> for our purposes here
> > >>>
> > >>> Would there be a difference in the data model if such data were
> > >>> designed for storage in XML documents compared to if designed for
> > >>> storage using an SQL-DBMS?
> > >> Well, the toppings, as we discussed previously, may have a meaningful
> > >> (by itself) order, so you'd have to take special measures when
> > >> implementing in SQL.
> > >> With all the other stuff the order by itself does not have a meaning,
> > >> so you'd have to take special measures for all of that when
> > >> implementing in XML.
> > >
> > > And with XML, it is likely that the toppings would be taken as elements
> > > under the pizza, rather than having a separate PizzaToppings table.
> >
> > Yep. About query-bias: Now we want to actualize our ingredient storage
> > based on the sales.
>
> Not a problem, but here's how I justify that -- I have been in mgmt
> positions where I requested reports from different database
> environments. Pretty much across the board when I reqeusted
> information and the data came from a non-SQL-DBMS, I got my reports
> faster, and they were accurate when I got them. Maybe a student
> employee wrote a calculated field (derived data, computed column,
> virtual attribute, user-defined field) to get it out for me, but as a
> mgmr, I got what I requested in a timely fashion. This is a gross
> over-generalization, but I'll still say it -- reports from SQL-DBMS's
> were more likely to be incorrect (the person using a reporting tool or
> writing code messed up related to joins or nulls or whatever selection
> criteria with their first shot at the report). They also seemed to
> take more people hours to get them from requirements to production
> software.
>
> > >>>> The next (XML, but it doesn't matter)
> > >>>> document created by dumping a database
> > >>> This would not be a database dump,
> > >> Yes it would, in theory - it is one way to discuss the differences.
> > >>
> > >>> but the creation of an application
> > >>> using XML documents (yes, I realize this isn't a highly scalable
> > >>> approach to building an app, so we can assume the data volume will be
> > >>> small if that helps set the scene -- perhaps these are pizza orders for
> > >>> a school sale)
> > >>>
> > >>>> may differ from the previous, even if the (SQL, but it doesn't matter)
> > >>>> database content stayed the same.
> > >>>> Arbitrary order would have to be added to prevent this.
> > >> This is the point of the database dump approach - I
> > >> wasn't suggesting to actually start with a dump.
> > >> Do you see why this is problematic?
> >
> > Please comment on this.
>
> When you ask "Do you see why this is problematic" I'm not sure what the
> "this" is, which is why I skipped by it. Is it the arbitrary order
> that you find problematic? The only way that data can be communicate
> is in an order, whether specified or arbitrary. I'm pretty sure I'm
> missing your question here, sorry -- please restate it.
>
> > >>>>> What would be some best practices for modeling data in
> > >>>>> this environment?
> > >>>> /this environment/
> > >>>> A marketing environment? Fantasy island?
> > >>> Yes, mAsterdam, let's say that it is a pizza sale for a school on
> > >>> Fantasy Island.
> > >> Let's eat pizza, then, and not discuss pyrotechnics.
> > >
> > > lol
> > >
> > >>>> </Annotations>
> > >>>>
> > >>>>> I'm guessing some will think that the exact same logical data model
> > >>>>> would be appropriate for both targets, but hopefully many will agree
> > >>>>> that it is unlikely that the best implemented data model would be
> > >>>>> identical in each environment. In that case, what would the
> > >>>>> differences be? What best practices would apply to data modeling for
> > >>>>> XML documents compared to data modeling for a SQL-DBMS?
> > >>>> Can't really answer that except "You don't".
> > >>>> The question by it self shows to much wrongs.
> > >>> There is a question here that is legitimate,
> >
> > >> Ok, then please try to rephrase - from your blogs
> > >> and previous discussions I know you are able to reposition
> > >> (though I have to admit that some things are really tough to get
> > >> through to you: constraint handling, for instance).
> > >
> > > OK, but it is also hard for you to get your constraint handling all the
> > > way through to the end-users to maintain, right ;-)
> >
> > ? I don't get this, sorry.
>
> I want to push as many constraints as feasible into the data, so it can
> be maintained either by configuration of implementors or by end-users.
> If constraints are coded to SQL, they are not typically available for
> end-user modification.
>
> > >>> even if I have not yet hit
> > >>> the nail on the head. I've tried to ask this question a number of
> > >>> times in various ways, as you know, but each time the question is
> > >>> considered abberant. If we paint the scenario with what might be
> > >>> termed an "embedded database" that is not an SQL-DBMS, perhaps
> > >>> Berkeley-DB or Cache' or Pick so that we have the same ability to
> > >>> remove 1NF and 3VL-style of NULL-handling, then ask the same question,
> > >>> would it be a legitimate question? If so, then same question.
> > >>>
> > >>> I'm certain that over the past half-century of software development,
> > >>> some data design patterns have shown themselves to be better than
> > >>> others for such target environments. I'm wondering what both theorists
> > >>> and practitioners think would be best practices when data modeling for
> > >>> these environments.
> >
> > >> Separate the logical model design from the implementation design
> > >> (implementation including SQL-DBMS).
> >
> > > I have read and worked with these definitions trying to use the most
> > > common forms, but I am OK with splitting out what I am calling a
> > > logical data model into an LDM and an implementation data model if that
> > > clears anything up. Thanks. --dawn
> >
> > Rephrasing the question does not seem trivially easy or
> > straightforward to me. As you may see I am trying to guess
> > what it is and contribute some to what it might be (and ramble a bit).
>
> OK, I think if we can have common defs for logical and implementation
> models, that will help me rephrase in those terms. Cheers! --dawn

Already into a quagmire of definitions then. Look dawn, I'm personally not comfortable with the definition of 'data' in the term 'database', much to Bob's dismay at those who share my 'new-fangled' view, but what I did learn from him is that when we communicate it is just obstinate not to share established derivations. So now I happily use data to mean a statement of fact. Its hardly going to kill me, and it means people can understand each other. Equally we all use a (the!) common understanding of Logical Data Model here. There's just no point in being polemic - just use the consensus definition and we can all happily comprehend each other without 30+ posts on definitions. There is simply no need for renaming.

Now to address your original post again, with something that may perhaps be useful - it occurred to me that given XML forces everything, naturally or otherwise, into a hierarchy, it might be worth you investigating something like the ACM portal for data structuring standards in hierarchical databases. Obviously the references will be old because of hierarchical databases being superceded by the RM, but you might find something of value for you. J. Received on Thu Oct 26 2006 - 11:59:44 CEST

Original text of this message