Re: O'Reilly interview with Date

From: dawn <dawnwolthuis_at_gmail.com>
Date: 9 Aug 2005 23:07:04 -0700
Message-ID: <1123654024.708066.117820_at_o13g2000cwo.googlegroups.com>


erk wrote:
> dawn wrote:
> > While I don't really know WHAT it means, the term "semi-structured"
> > indicates that the data are not "structured".
>
> How can "data" not be structured in some way?

I really, really dislike the term "semi-structured" and I indicated that I really don't know what it means. I do have my own picture, however. I think of it (not the same as defining it) as structured data where there is at least one subtype of String that has some agreed-upon parsing functions available for values of that type. Queries can take place based on data as usual and also based on the marked up data values associated with such subtypes. So, from my perspective, it is still structured data and the "semi" part has to do with more complex types.

Instead of taking all "true propositions" in the problem space and turning them into relational predicates, for example, some are modeled as Strings some as marked up Strings. This is one way that I can sortof-kinda understand why it is not entirely structured -- we didn't take every true proposition and do the same structuring activity with it. Of course, this is nothing new. Any database modeling free-form comments has done this. It is also not uncommon for end-users to create standards for how to "mark up" these otherwise freeform comments to make for easier queries.

And then the simplest way I think of this "semi-structured" stuff is that it is structured data that includes stuffing documents into String values. Big deal. That doesn't make it less structured. If we put cos(0) into a value and have functions associated with the type that equate this with the number 1, that doesn't make the data less structured simply because the values are more complex and have more complex functions associated with them. Semi-structured is working with more functions on strings.

I could be way off base on this, however, and if so, please correct me.  I still find the term a mystery.

> Given your comments, I
> have no clue what "semi-structured" actually means - although I
> probably didn't before either.

I'm sure I clarified it really well above ;-)

> It seems to mean nothing more when used
> in XML than defining a CDATA node with arbitrary content, something
> relational would support as well

yup

> (though like driving a truck with your
> feet, it's still not a good idea).

CDATA likely isn't, but a long character string with no more definition than that is likely just fine for some purposes -- course catalog descriptions for a college, for example. It could even include markup -- fine with me. Maybe an RDBMS that includes an XML data type is semi-structured even though neither an RDBMS nor an XML document would be. Just thinking out loud. Is it too late to kill that "semi-structured" term?

> > I'm willing to grant that they are, or could be seen as, the same
> > logical model. But from what I've learned on this forum and what I
> > have read, the legitimate issues with databases based on this model way
> > back when were implementation issues -- physical pointers, for example.
> > I have seen nothing that explains why modeling propositions as
> > relations is better than modeling propositions as trees.
>
> Trees of what? Propositions? Or is the tree itself a propositions? Is
> every non-leaf node then a "composite proposition"?

My opening statement on this, before I really answered the question, would be to turn over the paper napkin (so hopefully this is not a nice restaurant with cloth), write a proposition, and then do a "sentence diagram" like we did in grade school way back when. Proposition --> Tree. Then after you pushed me, I'd answer the question better than that.

> The difference is that independent propositions, and the ability to
> derive conclusions based solely on values, is simpler. Relations can be
> used to define "links," should you need them. Using such a "link-based
> data structure" for all data, though, adds complexity and ambiguity.
>
> > > Can you give a mathematical argument for tossing aside any approach at
> > > all in software?
> >
> > No, but I don't claim that there are mathematical reasons for the RM
> > being superior to any other data model, where I have read statements
> > that either state or imply such.
>
> Then forget the assertion of "mathematical reasons." Clearly none of
> the other data models can claim more of such then RM anyway. RM is
> closer to set theory and predicate logic than the others, and those
> things are, at least as far as I can determine, more tractable than the
> more complex logics of trees and graphs.

ok

> > > Perhaps you could define what you mean by
> > > "mathematical" in this context. The primary "soft" argument I'd use is
> > > that relations reduce complexity
> >
> > over trees? That might be true for mapping some parts of a problem
> > domain to a data model, but surely it is easier (for a human being) to
> > conceptualize of a family tree as a tree, right?
>
> Developers regularly use and conceive of things that boggle the mind of
> users,

The users in this case are developers. It is easier for them to model a tree as a tree and no good reason, even if some behind-the-scenes added complexity, for them not to imo.

> so that's a non-argument - even a tree-based system will use
> other data structures and algorithms which fall beyond the pale of
> "common sense." A family tree might indeed be better expressed as a
> tree, sure (having never had to manage such data, I don't know). File
> systems, org charts, bills of materials, and so on are all more
> manageable as relations, at least if you're doing more than just
> editing data items (e.g. relations are more amenable to different
> "human views" of the data).
>
> > For whom does a relation reduce complexity?
>
> Everyone. The end-user is irrelevant

I'm trying to make things easy for some of the primary users of the dbms -- developers

>- developers will provide
> (hopefully!) appropriate interfaces for them anyway.

But why have developers continue to write inserts & deletes for ordered lists (just one example I pick on a lot)? Why not tell the dbms that this is a list and then pass this metadata information to the ui as well, with each of these (dbms & ui) "understanding" what that means in their realm and acting accordingly with inserts & deletes? Sure a programmer did this in the 60's and then the 70's and we can keep doing it in language after language, but if we make legos with faces on them, then we don't have to build faces out of legos every time we want one.

> Having seen user
> experiences with both XML (even in context-sensitive XML editors) and
> SQL DBMSs (using queries in MS Access to write and modify reports),
> I'll choose something relation-scented any day.

I suspect you might, perhaps agree that it is the maturity that is appealing.

> And having seen my own face trying to grok the "data model" of XML (in
> the absence of a more sensible "standard") and its manipulation (e.g.
> in Java), I'll again choose sorta-relational.

Having done the same, I agree. It ain't there the way I'd like to see it.

> > > and enhance flexibility,
> >
> > Is there proof of this?
>
> No. I'd wager none is possible, and even "evidence" is scant or
> nonexistent.

If someone can come up with a reasonably predictive flexibility benchmark ...

> > > I'd call it something of a cousin of Occam's Razor, though perhaps
> > > there's a more formal argument a logician might use.
> >
> > Because simplicity is not measurable in how to model data, it is not
> > useful to say "my way is mathematically simpler, therefore beter".
>
> That's not true at all. Striving for simplicity is always useful, for
> even when we can't define simplicity, we typically notice its absence
> and can at least triangulate as we spiral in a "Heisenbergerish
> fashion" around the target.
>
> > When I try to explain to my mom that it would be simpler to model the
> > family tree as relations, ...
>
> With all due respect, I don't care what your mom thinks of it (or mine,
> for that matter). The end user, in this choice (as in that of
> algorithm), is irrelevant.

but what if my mom wants to be a software developer? I might just be beating my head against a wall after attempting to teach business majors java (don't ask). But software developers don't leap from the womb that way and they have a ton to learn. Why teach them something harder than necessary?

> > If we are looking for the simplest
> > mathematics that meets all of the requirements, then we need to lay out
> > the requirements, not just look for a simple theory.
>
> The choice of data structure goes beyond the immediate needs of a
> single project; again, relational was originally designed for "shared
> data banks," and that is indeed where it shines.

I would suggest a SQL-DBMS for any implementation where multiple companies must directly use the dbms apis. I haven't seen such an implementation, however, but I suspect there might be a couple out there -- anyone?

> If I have a single
> application that only presents data in a single way, I need do nothing
> more than 'new ObjectOutputStream( new FileOutputStream("blah")).
> writeObject(myWholeApp);'
>
> > In the absence of an ability to use only axioms and logic to
> > prove a point, some strong emperical data would be helpful. I know of
> > no experiments in this area, but if you do, please let me kmow
>
> Nor do I. Where do we go now? "Common sense" routes?

intuition? examples? unfortunately, even when it doesn't sound like it, I'm inclined toward only accepting logical proofs or emperical data to back up claims, while I'm happy to let intuition and stories prompt wild conjectures. The RM has some logical proof within it, has some emperical data (market share,installed base) but in the category of "this is the best way for a dbms for many/most business data processing applications" it lacks both, right?
Cheers! --dawn

> - Eric
Received on Wed Aug 10 2005 - 08:07:04 CEST

Original text of this message