Re: Modeling Data for XML instead of SQL-DBMS

From: mAsterdam <mAsterdam_at_vrijdag.org>
Date: Fri, 27 Oct 2006 22:48:26 +0200
Message-ID: <45426ff3$0$322$e4fe514c_at_news.xs4all.nl>


dawn wrote:
> mAsterdam wrote:

>> dawn wrote:
>>> mAsterdam wrote:
>>>> dawn wrote:
>>>>> mAsterdam wrote:
>>>>>> dawn wrote:
>>>>>>> mAsterdam wrote:

> <snip>
>> In the logical model there is strictly no
>> more need for implementation specifics than in the conceptual
>> model. However, the team is closer to cut-over day, so
>> implementation is more on their mind. So in real logical model
>> documents you will see some implementation stuff, but that
>> is noise, not signal.

>
> OK. Restating this, I think that here you are saying that a designer
> could create a logical data model without knowing whether the
> implementation will use a SQL-DBMS, a MUMPs product, a Nelson-Pick
> product, or XML. Additionally I think you are saying that when the
> design does know the target, this often influences the logical data

          ^team<
> model, but that is not a good thing.

Yep.

>>> there we go to "design."  I think of the conceptual model as analysis
>>> and the logical model as design.  I don't really think about any other
>>> models, with iterations of the logical model until it becomes the
>>> version of the logical model that gets implemented.  I used to think of
>>> the logical model as the one at the start of that process, but after
>>> reading a bit, I started to think of it as the one at the end of the
>>> process (the implementation model).  It is the last one that includes
>>> design related to optimizing for the target environment.

>> If these different models are defined, they are
>> defined as deliverables in some 'formal' (handbook/official/case-tool)
>> software development process. There is no generally accepted
>> reference process, AFAIK, so I have to rely on those I dealt with.
>> In one official process the requirements for these deliverables
>> (or artifacts if you prefer) were such that the conceptual models
>> allowed for many to many relationships, while in the (integrated)
>> logical model they were forbidden.

>
> Why should they be forbidden when they are quite viable logically and
> can be implemented in some DBMS tools?

[m:n]
No 'should', really. Whatever the implementation will be, at some point in the process the many to many relationships will have to be transformed into named things (SQL: tables with at least two foreign keys, Codasyl: records in at least two sets, etcetera).

This particular process had the naming of those in the steps which led to the logical model - quite workable.

BTW from this one can read that the conceptual model was ERM-based, not ORM - with ORM many to many relationships come out named anyway already during the conceptual modeling.

>>> The question I have now is whether the logical data model presupposes a
>>> family of target DBMS's, such as SQL-DBMS's or Pick DBMS's or whether
>>> the logical data model is data-model independent.  Pascal's definition
>>> makes it very data-model dependent.  That is how I was using the term
>>> in my original question.

>> You can't let go, can you.

>
> Yes, I'm very eager to let go. You might be aware that I am not
> Pascal's number one disciple. I just don't want to leap from one
> definition to not-one. So, I'm trying to nail down your definition
> since you are providing clues, but not a def I can work with as yet.

Hey! I would provide a definition if I could.

However, I already stated that a definition (if it proves to difficult to find one) may not be needed for your scoping and disambiguation needs. Careful description of the boundaries may be sufficient, in this case logic of data vs. implementation.

> On the one hand, it sounds like it is completely independent of the
> target environment, and then on the other hand you suggest that M-M
> relationships should be out of the logical model.

This particular logical model deliverable had as its requirement that there should be no many-to-many relationship. I am not saying logical models in general shouldn't have them. I could imagine another formal process were the (or a) logical model deliverable does have many to many relationships. You appear to be looking for the universally accepted software development process. There is none. Handbook/official/case-tool software development processes /do/ differ.

I do care that it (the naming of the many-to-many relationships) has to be done at some point, I do not care when exactly. There is no /the/ logical data model. However, as soon as implementation is the topic, it is not a logical model.

>> That definition is not
>> "common", AFAIK (and it is not clear).

>
> I understand that, so which of the myriad of other ones available comes
> close from your perspective?

I do not need one.
A software development process needs one, and I am happy to adapt.

> I'm just trying to jump from one that I
> thought was accepted (even if not my original understanding) to another
> one.

Not by me, but hey - maybe by everybody else? There are many Pascal-fans around, here. You can recognize them as such when they bark about VI's.

> Given that I did not grow up on the RM, I cannot rely on my own
> impressions for these definitions. I would like working definitions or
> even just descriptions that are workable. Obviously these defs are
> only helpful if they are target DBMS independent. I don't doubt we can
> find some workable definitions, but I'm not sure where to find one that
> works.
>

>> Do you have another reference?

>
> No, but at one point I collected about 50 descriptions or defnitions
> before I revised my own perspective on the meaning of the term in light
> of what I read.

Forget this one. It does not help.

>>>> When the implementation is in SQL the schema can be
>>>> very close to the logical data model so the distinction isn't
>>>> important most of the time.
>>> That is also how I think of it for other environments, which would make
>>> sense if the logical data model is related to the target data model.
>> No. Let's not get into this loop. Build on signal, not on noise.

>
> Aye, Aye.
>
>>>> In other environments
>>>> you can have a physical model, elements of which will have
>>>> to be thoroughly associated with elements from the logical
>>>> data model - but I would not call this physical model a
>>>> data model -
>>> I have no interest in the physical model the way I think of that term
>>> (other than to have knowledge of it for performace tuning for the final
>>> implementation model).
>> When you plan to implement hierarchically, you simply have to decide
>> what data is more important than other data,

>
>>From what I have collected to date regarding best practices for data

> modeling when the target is shallow tress on nodes of a di-graph
>
> [I suspect you are using the somewhat inflammatory term "hierarchical"
> both for its emotional impact and due to the trees on the nodes? It is
> both inaccurate and maybe even offensive, so I'm hoping that in the
> future you can avoid that term in this discussion ;-) ]

No inflammation intended. Hierarchy (trees of nodes), is what IMS, documents, file systems, XML, Lotus Domino and LDAP as data containers have in common.

ISTM that hierarchy is a key concept in finding out what deliberations concerning the organization of data are necessary without the need to get into the characteristics of the specific products.

What is inaccurate about that?
What is offensive about that?
How does it impact emotionally?

> nothing suggests that there be decisions about what data are more
> important than others.

What is going to be your top level?

> There might be selections about what portals or
> entry points users require into the data so those can be beefed up with
> virtual fields for reporting purposes (for example). It is fair to say
> that it is a less democratic approach than the RM in that different
> propositions and different aspects of propositions are modeled
> differently.

?

> Constructing the model is more like constructing the
> propositions themselves.

I don't understand this, either.

> Some data are entities of interest to users,
> others are entities more important to "the system," while others are
> properties of entities (that's at least one way of looking at it).

I am lost.

>> and you have to decide in
>> which order all data is organized.

>
> In the case of XML that is correct, but not so for all non-SQL-DBMS
> models.

That /is/ the topic, no?

> In most the ordering among tuples of data is known only to the
> DBMS unless sorted for a representation (just as with the RM). In
> many, the ordering of a tuple in a relation is more like the
> mathematical definitions of a tuple with the values ordered (a1, a2,
> ... , an) and names only descriptive of a place in the ordering.

>> This is a not yet tool/product
>> specific set of decisions you will have to make - say 'implementation
>> class' specific considerations.

>
> Yes, and I have been using the term "data model" for that. In other
> words, you can design for all tools that employ a particular data
> model, whether Relational, XML, Nelson-Pick, MUMPS, ... Then you tweak
> for a particular implementation of that model after your more generic
> design. What is the name of this aspect of design: the model that
> plays to a particular family of DBMS-products (aka "data model") but
> not to a specific implementation of that model (aka DBMS).

Ah, its not - you want to talk about Pick. That is not the topic you advertised.

>>>> I'd call it a storage model if I'd have to classify it.
>>> Yes, agreed.  I let the DBMS developers care about the storage model.
>> You can't if you are using XML.

>
> I can treat any model as logical. One could model for XML and
> implement that "schema" in a tool that persists the data using a
> relational model, right?

Yes, but your development team (including db-guys) will have to decide how exactly. DBMS's provide features, but the design decides how they are used.

>>> The logical model is the one I specify to the DBMS.
>> No. That is the schema.

>
> OK, that works for me. It appears your models are conceptual (plural)
> to conceptual (single) to logical to schema. Is that accurate?

Yes.

>>> I'll grant that
>>> with some tools the physical model more closely resembles the logical
>>> model than in others.
>> Yes. IOW the schema should reflect the logical model,
>> but how close it is to the logical model varies greatly.
>>
>>>> If somebody else would call it a physical data-model,
>>>> I would not interrupt, the message is clear. Calling it a
>>>> logical model /would/ make me object; it's wrong:
>>>> the physical model is not /about/ the logic, it presupposes
>>>> the existence of a logical data model.
>>> Agreed.
>> Then please stop to redefine the logical data model as
>> implementation specific.

>
> ? I'm begging for new definitions that work.

For scoping and disambiguation, right?

> I could maybe drum up
> another one, but I "feel" that it would get the same response. So, I'm
> happy to try to learn your definition from your descriptions, but I'm
> having to ask questions to nail it down.
>
> Even if you have to repeat yourself, I would be very appreciative if
> you could be clear on this question:
>
> Does the logical data model assume any family-of-DBMS-tools as a target
> (what I have referred to as a "data model"), such as Relational, MUMPS,
> Nelson-Pick, XML, OR, alternatively, is it independent of the target?

I would appreciate if others would chime in here; this is just my opinion:

In my book (as a manner of speaking - I never wrote one) logical data models are independent of the implementation target, even family-of-DBMS-tools. As far as a delivered 'logical model' document contains implementation considerations, they are out of scope addenda, however useful they may be.

Clear enough?

> If the latter, which is what I think you are saying, then why are you
> removing M-M or designing to avoid nulls (see example later).

The m:n removal was simply a formal requirement in that particular process. I commented on that above.

[snip more logical model definition talk]

>>>> Sharing data: Use of the same data from more than one point of view.
>>> Well, I am working with "shared data" by this definition -- many points
>>> of view, but they are all "managed" collectively.  There need be no
>>> assumption that each software entity that is sharing this database does
>>> so in its own silo, nor that you must permit this sharing by code
>>> written by developers who do not talk to each other
>> This limits the team size.

>
> That might be. I have not worked on a team larger than 200 (with a
> business org structure, of course -- not one big unpartitioned team)
> all developing for the same DBMS and/or integrating 3rd party tools
> into it (not all co-located, by the way). But my impression is that
> there are teams that are much larger (in larger organizations) doing
> similarly.

Impressive.

>>> do shared QA, for example.
>>> With Codd's "large shared data bank" I think there is
>>> some assumption that we need to be able to permit people who don't know
>>> each other to each write code that shares only the database and nothing
>>> else.
>> I'd say 'Codd's approach allows for people who ...etc'.

>
> works for me.
>
>>> That would (typically) not be acceptable for the database
>>> products I'm talking about.
>> Why not?

>
> Do you know the answer and are just testing me? I'll play --

Heh. Pleading not guilty - the 'not be acceptable' just sounded strange to me.

> Because the constraints are specified in data (functioning as
> metadata), rather than being specified in schema. Proprietary
> libraries are written against these so that there are cross-app
> components on top of the DBMS, with vertical apps on top of those,
> instead of vertical silos sitting right on top of the DBMS. Bottom
> line: Apps share libraries related to puts, gets, and related
> constraint logic. Some might not like it, but a lot of software has
> been built and continues to be built with this architecture.

Yes. The relevance to database theory seems to be: life is possible without it :-)

[snip]

>> AAAAaaaRghgrrrmpf. NULLs. Side note: Lots of of people here appear to
>> like to talk about that. If you do, please take over :-)

>
> I only bring it up as an example of where I see differences in models
> somewhere after the conceptual model (afraid of using the "logical"
> term), starting in "high level design" between the project-specific
> models when the target is an SQL-DBMS than when it is something other.
> I don't need to get into another bru-ha-ha over that.

I know, I know, it's relevant.
Does not make me like the topic, though.

>> How did, working from real examples, the NULLs get in?
>> Where did they come from? When did they get in?

>
> One model related to propositions such as:
>
> Amy is married.
> Hal is divorced.
> Sylvia is single.
> Hope is married.
> We do not have enough information to know whether Lily is married or
> not.

The last one is strange mix of non-fact and meta-information. Is it stating that we need to know Lily's marital status? If so, why? What are the requirements?

Here is another breakdown.

Amy is a person.
Amy is married.
Hal is a person.
Hal is divorced.
Sylvia is a person.
Sylvia is single.
Hope is a person.
Hope is married.
Lily is a person.

We still do not have enough information to know whether Lily is married or not, but no strange proposition stating that as a non-fact.

> could be a Person entity with attributes that include firstName and
> maritalStatus.

... allowing some kind of NULL in maritalStatus - so, implementation dependent.

> Another model related to this same set of propositions could be a
> Person entity that includes the attribute firstName and a
> PersonMaritalStatus entity that includes the attribute personId
> (foreign key to Person) and maritalStatus.

'foreign key' is SQL. It seems that at least this NULL got introduced when implementing - or modeling with implementation in mind, not just the facts.

[snip]

>> There is no difference between conceptual and logical modeling as far
>> as the need to include implementation details is concerned.
>> As a deliverable, logical models cover more detail, and will in
>> practice, mistakenly, contain more implementation specifics.

The NULL marital status above is an example of such a mistake.

> OK, confused again. You indicated that M-M relationships between
> relations were not in logical data models.

No. Please re-read that part (universal development process etc..) and the second comment

> There are Books-Authors
> types of M-M relationships implemented in DBMS's using the Nelson-Pick
> model. I am really, really trying not to be stubborn on this, just
> trying to understand.

Ok. Did re-reading and the comment above (search for 'm:n') help?

>> I think that in order to investigate your question,
>> <rephrase> (correct me if I'm wrong, please)
>> If we have a logical model and we decide to implement using
>> hierarchical tools /without/ specifying which one
>> (IMS, Lotus Domino, XML, LDAP, just to name a few alternatives)
>> now what? Which choices do we have to make?
>> </rephrase>
>> you will first need to get this distinction right.
>>
>> While on the one hand you say that you are willing
>> to adopt the terminology I use on this distinction,
>> up to now I only see you trying to blur it.

>
> My questions are asked in order to understand it, not blur it. Sorry
> so dense and thanks for your patience. I'll get there, but I'm not
> there yet. There are still seemingly contractions in what you are
> telling me.

Point them out, I hope my wordings improve because of this and similar exchanges. That's what I am getting out of this. So, no need to worry. Unfortunately normally I just don't have the time I happened to have the last few days.

> I need to understand where in the list of models the
> designer will need to have an understanding of the target DBMS
> environment.

Schema design.

> attempting to use your terminology
>
> conceptual models --> conceptual model --> HERE --> logical model -->
> schema
>
> OR
>
> conceptual models -->conceptual model --> logical model --> HERE -->
> schema

I'd say at the second spot.

> If the first of these, then the marital status logical model would be
> the one with firstName and maritalStatus as attributes of Person, if
> the latter, then if deploying in a SQL-DBMS (assuming a no-NULLs policy
> or best practice), the second would be the model.

Why is that SQL specific? I don't think so. Received on Fri Oct 27 2006 - 22:48:26 CEST

Original text of this message