Re: In an RDBMS, what does "Data" mean?

From: Eric Kaun <ekaun_at_prodigy.net>
Date: Mon, 19 Jul 2004 17:48:37 GMT
Message-ID: <VPTKc.1179$vN1.1166_at_newssvr19.news.prodigy.com>


Anthony W. Youngman wrote:
> In message <NSpyc.2728$pK3.2253_at_newssvr15.news.prodigy.com>, Eric Kaun
> <ekaun_at_yahoo.com> writes
> I can understand you want things nice and clear cut, but the real world
> isn't like that.

I do understand that, but machines are remarkably literal; they sort of need thing spelled out to them, so at some point we need to decide what's important and then how to represent it. Software is description. [Michael Jackson's quote, not mine, though I agree with it.]

 > I use relational theory to help me understand the data
> down to one or two levels deeper than I need, then I draw the line at
> whatever level seems appropriate.

This is interesting:
1. If you're just understanding the data, how do you know what you need, much less what "one or two levels" beyond that is? 2. What's a level?
3. Appropriate with respect to... what? Some function of requirements, I imagine, but it would be nice to formalize the rules-of-thumb.

> Don't forget, I'm a chemist by training. If I'm doing bio-chemistry it's
> incredibly useful to understand electron orbital theory but I can't WORK
> at that level. It's just *too* abstract to be meaningful.

Couldn't agree more.

> By abstracting data down to (and focussing on) the tuple, relational
> theory has just gone into TOO MUCH detail and lost sight of (indeed, to
> some extent DESTROYED) any view of the big picture...

The big picture as presenting to users is a different thing, outside the domain of relational. How relational losing sight of the big picture destroys is it amusing, but

And the tuple isn't the focus - the relation is, including constraints on relations. As logical predicates, they bear an uncanny resemblance to many "classes" of rules and requirements.

Of course, you can look at just the "raw data" - typed attributes, values in tuples. Or you can look at individual predicates. Or you can look at database constraints (constraints over the set of relations). Wow! That's three... Three... THREE levels of abstraction in one!

Is that the final word? No, of course not. Date draws a distinction between physical (DBMS storage), logical (predicates), and conceptual (mapping to the user). Views, screens, reports, etc. all help paint the "big picture" to the user. But the big picture won't fully develop if the components aren't there, or the logical underpinnings are suspect.

> What you want to do is present the user with a view of the data at their
> level, and then analyse it deeper.

So you're talking about a RAD / prototyping / extreme programming approach to data design? This seems more like process than logical definition of the data.

> As a chemist, I think in molecules. As a businessman, I think people
> tend to think in terms of customers, invoices, things like that. THAT is
> the level at which the database should interface with users.

Not a bad idea, though the users of a database tend to be developers. Some users ("power users") can handle SQL and reporting and such, but not that many. Still, given that relational's domain is "shared data banks," a logical representation which supports multiple users and multiple applications (and user views) is likely to be lower level - you need to be concerned with those issues to make decisions that help the big picture look real purty.

> Relational interfaces at the chemical equivalent of atoms - with the
> tuple.

There are no tuple-level operators in relational; although each tuple is a fact, operations are defined over relations (predicates), not individual tuples (which are not to be singled out).

 > The poor programmer has to think UP to the "business object"
> level, and then UP AGAIN to the reality equivalent.

Again, sometimes the poor programming is thinking the other way - not of the order, but of which customers in certain states have placed more than 3 orders which contain both condoms and ice cream cones (or some such combination). From that standpoint, the "business object" is... what? In a hierarchical data definition, the most common operations to data-entry clerks are obvious; everything else becomes convoluted procedural logic. (generalization noted)

> With Pick, I can stand at the "business object" interface, and reach
> DOWN into the data, and UP into reality. It's far easier to stand on the
> interface reaching in both directions, than to be mired down in the
> detail, struggling to get out.

Hey, the interface to the real world is messy - deal with it. That interface is an ever-shifting beast in any "real business" I've ever dealt with. With a shifting interface (one which changes radically as you follow data from department to department), a lower-level (sic) definition of data is a better support system than a "big picture."

By "standing on" the interface, you depend on its stability. I believe it to be far, far less stable than the logical definition of the data, which perhaps might be "small picture." But I'll deal with it there, where I have some power.

> I am sorry I can't give you a better answer than that. But the real
> world is messy. Deal with it!

That's exactly what we're all trying to do; fuzzy definitions like "messy", "real world", and "big picture" aren't going to give us much purchase in the attempt.

>> Seldom, due to business desires, but to answer the question you're 
>> getting
>> at: when there's a foreign-key dependency, and there's one relation 
>> that is
>> deemed "important" enough to trigger the cascade. There could be 
>> multiple,
>> though that's rare...

>
> So I would I would seriously consider pulling all the tables into which
> the cascade went into a single FILE. "Rules are for the guidance of wise
> men, and obedience of fools" - I would use my intelligence as to whether
> this made sense.

I agree - if it makes sense, it's not a bad idea. I have just found very few cases where it has made sense, but not every database I've designed as been fully normalized (and that was by design).

One case in point: an issue-reporting database that I did during a meeting as a prototype which quickly went production. Before it did go production, I normalized what had been denormalized structures; and was glad I did, because subsequent report, query, data export, and even screen view requirements would have been tricky without it. Not impossible by any means, but far less intuitive.

>>> Yep ... but relational theory, which imposes mandatory separation of the
>>> logical from the physical, imposes that cost on EVERY app, not just
>>> those that update the data.
>>
>> And it enables EVERY app with EVERY optimization. (not close, but you 
>> get my
>> drift)

>
> But if the small cost of the "bad" code (which by definition is rarely
> used) makes a big difference to the cost of the "good" code, then I'm
> laughing all the way to the bank. Who cares if I add an hour to a job
> that runs once a month, if by doing so I can shave a second off a job
> that 50 users use several hundred times a day?

True (with reservations based on the context of those processes) - but I've been talking more about new requirements for reports, views, data imports/exports, etc. A normalized relational structure supports new requirements better; a denormalized one adds some initial overhead. As far as performance, I have no doubt that Pick performs well, but haven't been so constrained in terms of hardware and design that I would denormalize to save... something.

>>> Yup. Let's assume that the Pick database has been designed properly, and
>>> that within the FILEs the data has been normalised. I can now present my
>>> apps with a *closed* relational view!
>>
>> What do you mean by that?

>
> I mean it's like a relational view, but if I've got a "one to many"
> relationship, the "one" data only appears once, not replicated for every
> instance of the "many".

Well, a hierachical database is likely to deal natively with hierarchies (which is what you're talking about). From a relational viewpoint, you don't have "one thing" - you're talking about two predicates (the parent and the child).

>>> My Pick application has also FORCED, by DEFAULT, my database to store
>>> related data close to itself (what relational calls clustering, I
>>> believe). It's fairly easy to prove, statistically, that this will
>>> optimise data retrieval from disk.
>>
>> For that one access path.

>
> Here you go again -

Uh oh... flashbacks of the Jimmy Carter - Ronald Reagan debate... (in which Reagan repeatedly used the phrase "there you go again" to avoid actually having rebut a point).

 > crippling the race horse so we can have a "fair"
> race against the crippled old nag ...

A poor analogy, though I can't think of a better one. I'm talking about being able to support new and changed requirements with a minimum of change, as well as being able to firewall the integrity of my data from programmer error (including my own!), and declaring the meaning of my data (for humans as well as for enforcement).

For the record, yes, I'm sure Pick can load that record real darn fast. But I have yet to see an application that, based on a primary key, couldn't load the vast majority of its associated hierarchy in an unnoticeable amount of time. It just isn't that hard. And in terms of code burden, mapping tools make it a no-brainer.

So I'll form my own analogy: you're talking about having that racehorse cross the finish line 1 second faster (when it already was beating the other horses anyway), albeit dropping the jockey on his arse en route. :-)

> That crack about the race horse was deliberate. Relational seeks to make
> all access paths equal. Fair enough. Rather like the UK educational
> system that sees competition as "unfair" and wants all schoolkids to
> leave Uni with a first class degree, not caring whether they are a dunce
> or a genius (sadly, I'm serious about our education :-(
>
> You said "for that one access path". But that access path IS THE MOST
> COMMON PATH! So. I can prove that it's the most common path. I can prove
> it's the most efficient path.

Sure - but taking alternative paths to that same endpoint aren't much slower. Yeah they're slower... but so what? It doesn't always make a different, and in my experience, it's the reports and ad hoc queries and dataloads and such that demand performance optimizations. I don't give a damn whether loading my Order is done in one read in 0.09 seconds, or in several reads in 1 second, since my UI is probably going to take its sweet time painting anyway...

Obviously I'm not that naive, but I think your optimizations to the most-common path, while certainly an improvement, may have a less-than-noticeable impact on most users. But I could certainly be wrong.

> Can you prove, that by crippling the most common path, you can improve
> the "worst path" cases enough to make it worth-while?

 > Was it Knuth that
> said "premature optimisation is the worst evil"?

"Premature optimization is the root of all evil", and while Knuth gets the credit, he says Tony (C.A.R.) Hoare said it first.

 > I couldn't give a damn
> if the nag trails in last by a racecourse. I want the thoroughbred to win.

Well, we differ. If the horses are functions / applications, I want as many horses as possible to finish before the jockeys drop dead of old age. :-)

Thanks, Anthony, for the lively exchange(s).

  • erk
Received on Mon Jul 19 2004 - 19:48:37 CEST

Original text of this message