Re: In an RDBMS, what does "Data" mean?

From: Dawn M. Wolthuis <dwolt_at_tincat-group.com>
Date: Fri, 18 Jun 2004 15:32:48 -0500
Message-ID: <cavjhq$e75$1_at_news.netins.net>


"Anthony W. Youngman" <wol_at_thewolery.demon.co.uk> wrote in message news:L4l1l0HM7y0AFwZd_at_thewolery.demon.co.uk...
> In message <NSpyc.2728$pK3.2253_at_newssvr15.news.prodigy.com>, Eric Kaun
> <ekaun_at_yahoo.com> writes
> >"Anthony W. Youngman" <wol_at_thewolery.demon.co.uk> wrote in message
> >news:rDtNrmL7f6xAFw2a_at_thewolery.demon.co.uk...
> >> In message <qJJxc.385$Pt.272_at_newssvr19.news.prodigy.com>, Eric Kaun
> >> <ekaun_at_yahoo.com> writes
> >> >"Anthony W. Youngman" <wol_at_thewolery.demon.co.uk> wrote in message
> >> >news:9EJ4q$GofOxAFw1i_at_thewolery.demon.co.uk...
> >>
> >> In theory, no. In practice, you might choose to split the invoice data
> >> across two FILES, where you've promoted sub-attributes of INVOICE to be
> >> primary attributes of the secondary file.
> >
> >Why would you do the split, when the intent seems to be to keep things
> >whole? Is this purely for performance optimization? I have a hard time
> >keeping up with shifts between logical and physical, and the reasons for
the
> >splits. I understand you CAN do these things, but why and when? What are
> >your heuristics?

>

> The heuristics are probably when it gets too complicated for the brain
> to comprehend easily.

I suspect there are some papers about this as it relates to XML documents and Jan Hidders did point to some papers re normalization for XML at one point IIRC. The rule of thumb is to try to match up to the way people think when designing the logical structure of the data in a "nested relational" structure (I don't like that description of XML, but it gives some hint that we are still working with relations)

> Okay, I'm getting physical, and messy, but that's the real world. Where
> do you draw the line between biology and organic chemistry? Between
> organic and inorganic chemistry? Between chemistry and the physics of
> atoms?

>

> Okay, there is a pretty clear line between the physics of atoms and
> atomic physics, but that's an anomaly!
>

> I can understand you want things nice and clear cut, but the real world
> isn't like that.

so, then, you are NOT a Calvinist? ;-)

> I use relational theory to help me understand the data
> down to one or two levels deeper than I need, then I draw the line at
> whatever level seems appropriate.

>

> Don't forget, I'm a chemist by training. If I'm doing bio-chemistry it's
> incredibly useful to understand electron orbital theory but I can't WORK
> at that level. It's just *too* abstract to be meaningful.
>

> By abstracting data down to (and focussing on) the tuple, relational
> theory has just gone into TOO MUCH detail and lost sight of (indeed, to
> some extent DESTROYED) any view of the big picture...
>

> What you want to do is present the user with a view of the data at their
> level, and then analyse it deeper.
>

> As a chemist, I think in molecules. As a businessman, I think people
> tend to think in terms of customers, invoices, things like that. THAT is
> the level at which the database should interface with users.

The analogy to molecules is an excellent one. Looking for, and defining, the molecules among the data in our problem domain is really what we do with our logical data models.

> Relational interfaces at the chemical equivalent of atoms - with the
> tuple. The poor programmer has to think UP to the "business object"
> level, and then UP AGAIN to the reality equivalent.

>

> With Pick, I can stand at the "business object" interface, and reach
> DOWN into the data, and UP into reality. It's far easier to stand on the
> interface reaching in both directions, than to be mired down in the
> detail, struggling to get out.
>

> I am sorry I can't give you a better answer than that. But the real
> world is messy. Deal with it!
> >
> >> >> >4. sub-sub-attributes
> >> >>
> >> >> Simply nest sub-attributes one level deeper. :-)
> >> >
> >> >Ah, hierarchical induction. I'll just have one File in my app. :-)
> >> >
> >> Nah! FILE = noun :-)
> >
> >Okay; I name my file "MyApplication." :-)

That's where experience and best practices come into play, where you would play to the strengths of whatever tools you were using and would, likely, not name your file that.

> >> You completely missed the point here. Where and why would you use a
> >> cascading delete? THINK! Be *practical*. What *works* in *reality*
> >> (rather than theory, which can think up a thousand impossible scenarios
> >> before breakfast (with apologies to "Alice in Wonderland")).
> >
> >Seldom, due to business desires, but to answer the question you're
getting
> >at: when there's a foreign-key dependency, and there's one relation that
is
> >deemed "important" enough to trigger the cascade. There could be
multiple,
> >though that's rare...

>

> So I would I would seriously consider pulling all the tables into which
> the cascade went into a single FILE. "Rules are for the guidance of wise
> men, and obedience of fools" - I would use my intelligence as to whether
> this made sense.

and it is definitely the case that for many of the non-DBMS's that are really enhanced file systems (which is where PICK really falls) that the developer is given enough rope to hang themselves.

> >> >> And then you use common sense to say "I use these fields all the
time,
> >> >> and these fields only rarely" so you split A into two physical
FILEs,
> >> >> and make all the colums of A-rarely into virtual columns of
> >> >> A-all-the-time, and vice versa. So for retrievals the user notices
> >> >> nothing (apart from the speed-up), although it does cost a bit extra
> >> >> logic when updating.
> >> >
> >> >A logical cost is a big cost. Every updating app needs to know that,
> >right?
> >>
> >> Yep ... but relational theory, which imposes mandatory separation of
the
> >> logical from the physical, imposes that cost on EVERY app, not just
> >> those that update the data.
> >
> >And it enables EVERY app with EVERY optimization. (not close, but you get
my
> >drift)

>

> But if the small cost of the "bad" code (which by definition is rarely
> used) makes a big difference to the cost of the "good" code, then I'm
> laughing all the way to the bank. Who cares if I add an hour to a job
> that runs once a month, if by doing so I can shave a second off a job
> that 50 users use several hundred times a day?

And you do have the option of having "services" that test business rules as well as others that perform updates to ensure the same degree of consistencey and decoupling of app and database as an RDBMS. These same services can be used to determine the appropriate GUI components. Excellent software can be written, but the database does not require anything of the developer -- there is considerable freedom (to shoot yourself in the foot) and this is what also provides the ease of maintenance.

> >You've just successfully argued against code sharing, by the way, since
if
> >something is coded badly (either slowly, or laden with defects), then
every
> >app has to suffer, so you're better off recoding it in each app, right?

No -- then fix it.

> >> Furthermore, by actively hindering the programmer from providing hints
> >> to the database, relational forces the programmer to rely on the
> >> database's artificial intelligence, which is quite likely to guess
wrong
> >
> >And you've also just argued against compilers, since they're so likely to
> >guess wrong about the intention of your code, and therefore will produce
> >badly-optimized machine code.
> >
> >> Yup. Let's assume that the Pick database has been designed properly,
and
> >> that within the FILEs the data has been normalised. I can now present
my
> >> apps with a *closed* relational view!
> >
> >What do you mean by that?

>

> I mean it's like a relational view, but if I've got a "one to many"
> relationship, the "one" data only appears once, not replicated for every
> instance of the "many".

The replication in a SQL-DBMS is only upon viewing granular data. I think I know what you mean by this, but perhaps it makes more sense to state it differently. A FILE would include the "one" and the many manys. We would not haves separate relations defined for each many, with multiple rows for each of the "one" in order to link such relations back to the "one" (master) relation. One FILE of PEOPLE in a non-1NF structure (such as XML docs or PICK) could easily turn into 20 relations in a SQL-DBMS. The 1 million records in that one PICK file could turn into those 1 million rows plus multiple rows for each of these records in each of the other 19 relations that were split out when putting into 1NF. Since each of those 19 files needs to have a candidate key, a lot of generated keys get built to accomplish this 1NF. So, there is a lot of extra data stored in the relational structure (in the form of lots and lots of keys).

I have the feeling I didn't actually CLARIFY your statement, Wol, sorry, but the way it is stated, I would object when my relational hat is on.

> >> My Pick application has also FORCED, by DEFAULT, my database to store
> >> related data close to itself (what relational calls clustering, I
> >> believe). It's fairly easy to prove, statistically, that this will
> >> optimise data retrieval from disk.
> >
> >For that one access path.

>

> Here you go again - crippling the race horse so we can have a "fair"
> race against the crippled old nag ...

Now, there's no reason to call me names ;-)

> >> Sod AI optimisation, Pick doesn't
> >> have a choice and it works, which is why in any system lacking
> >> sufficient ram a Pick app will kick the equivalent relational app's
> >> butt!
> >
> >I can think of several faster alternatives. Using ROWID and stashing
> >hierarchies in Oracle tables would at least close some of the gap.
> >Performance isn't the only point, but oh well...
> >
> >> Basically, by not hiding the physical implementation from the user,
Pick
> >> makes it easy to prove there just IS NO room for improvement.
> >
> >hahahahaha
> >
> >Oh - you were serious. My bad.

>
> Yes I was :-)

Yes, he was. I don't know about the PROVE part, but I do know that there is a HUGE difference between letting queries fly on the MV side of the house compared to tuning SQL statements ad infinitum on the SQL side. After trying to migrate people from the old (PICK) to the newer (SQL), I have become completely convinced that is a significant step backwards. If there were something comparable to ODBC for non-SQL structures, there would be no reason at all to consider SQL in those environments.

> >> By hiding
> >> the physical from the user, relational forces you to rely on the AI and
> >> you have no way of knowing whether it is efficient or not.
> >
> >AI? Yes, I'd hate to rely on something like a "computer" or some other
> >fancy "automaton" that does "logic" or some such liberal nonsense... :-)
> >
> I don't. I rely on statistics to tell me the Pick model does a better
> job than AI.

>

> That crack about the race horse was deliberate. Relational seeks to make
> all access paths equal. Fair enough. Rather like the UK educational
> system that sees competition as "unfair" and wants all schoolkids to
> leave Uni with a first class degree, not caring whether they are a dunce
> or a genius (sadly, I'm serious about our education :-(
>
A "no child left behind" jab on state of US education would be in order, but it is so upsetting that I can't muster one right now.

> You said "for that one access path". But that access path IS THE MOST
> COMMON PATH! So. I can prove that it's the most common path. I can prove
> it's the most efficient path.

>

> Can you prove, that by crippling the most common path, you can improve
> the "worst path" cases enough to make it worth-while? Was it Knuth that
> said "premature optimisation is the worst evil"? I couldn't give a damn
> if the nag trails in last by a racecourse. I want the thoroughbred to
> win.

Another good analogy. By trying to treat all relations as if they are of the same weight -- as if none is a more important entry-point into the data, we are breaking down the molecules and focussing only on the atoms, without giving any hints as to where the molecules are to be found. Those "one" relations in the one to many you mentioned above are often like named molecules, but invisible to the database user -- a case of missing the forest for the trees. Views can be built to add these molecules back in, but there is nothing it relational modeling that makes it clear, or even suggests, how to handle this.

Cheers! --dawn Received on Fri Jun 18 2004 - 22:32:48 CEST

Original text of this message