Re: Navigation question

From: Marshall <marshall.spight_at_gmail.com>
Date: 15 Feb 2007 08:58:53 -0800
Message-ID: <1171558733.374384.197810_at_v45g2000cwv.googlegroups.com>


On Feb 15, 6:17 am, "dawn" <dawnwolth..._at_gmail.com> wrote:
> On Feb 14, 7:24 pm, "Marshall" <marshall.spi..._at_gmail.com> wrote:
>
> > > What does the application code then do with this cartesian cross-
> > > product of data?
>
> > I don't think I've ever written a cartesian product into client
> > code. I've done a few at the command line, though.
>
> I am understanding this to mean that you do not create or use views
> used by your code that introduce cross-products, correct?

Embarrassing admission: I've never used views of any kind. My professional career has not included using higher end dbms products. :-(

However a view is not a lot more than a named query, as I understand it. And I can't think of when I'd have used full cross products. Not that there's any particular reason not to; there's no reason why they have to be any less efficient than doing the two subqueries yourself. In fact a guy at work was telling me the other day about a dbms optimization he'd seen where the system would transmit two relations instead of one and stitch them together in the dbms client code (not the application client code.)

> > Let's even do what you describe the hard way, and say that we
> > have some unique description of the company, TIN say, that is
> > *not* a key of the other tables, nor the primary key of the company
> > table.
>
> > select name from companies where tin = XXX;
>
> OK, I'll buy that you would not use the term "navigation" to find this
> node. I still think of this select statement as "going to" the
> company where tin=XXX and extracting the name, where I suspect you do
> not. I do not see it as navigating within the database, however, so
> we are good so far.

A guy Fred at work thinks of this statement as x86 machine code, and Joe thinks of it all as electrons whirling around silicon pathways.
Engelbart says they are both wrong and that it's all subatomic particles,
while Desmond claims they are all thinking at the wrong level; the truth
we can perceive is just shadows on Plato's cave.

However the statement itself doesn't have any "going to" in it.

> > select c.emailaddress from contacts c, company co
> > where co.tin = XXX and
> > co.companyid = c.companyid;
>
> So instead of retrieving the companyid in the first statement and then
> narrowing this query down to that precise company, we again "find" the
> right company by using the same query as before. We include two
> tables in our query although if we had retrieved the companyid in the
> previous query we could have used just one. I do not understand how
> that could yield the same or better performance than
>
> select name, companyid from companies where tin = XXX;
> mycompany = companyid;
> select emailaddress from contacts where companyid = mycompany;

The canonical case to consider is client-server.

Database performance is not my strong area, however it's easy enough to see that in my case, there will be an indexed lookup on tin that gets companyid, and then whatever mechanism (table scan?) will select all the email contacts based on that company id, in a single network round trip. Whereas your approach will do the indexed lookup in one network packet, then the client will issue a second request that is the same as the second part of my query. So the cost to the dbms will be identical, and the cost in network traffic will be two packets vs. one, which may well double the overall runtime.

I have observed much, much worse differences between the one query form and the split up query form in the field. A factor of 100 worse is commonplace; a factor of 10,000 worse is not unheard of.

> What, precisely, makes your approach better? [Please note that I am
> not suggesting it is not better, just that I do not see the advantage,
> oh, and don't say that what is better is that you are not "navigating"
> since I'm trying to figure out what is bad about navigating]

My first thought is overall application performance, because I often have to care about that, and about scalability of large systems. I'm sure others can supply other reasons. But I am hard pressed to think of a good argument for doing multiple queries where one will suffice.

> Would it be accurate to suggest that in order to avoid navigating, the
> developer never uses any data retrieved from any prior SQL statement
> as qualifiers for subsequent statements? The "game" would be to use
> only the data that comes from clients (whether people or other
> software) to seed the queries. So, you are not suggesting using only
> one query to populate data on a page, for example (as I was previously
> understanding from your response), but as many queries as one might
> have otherwise, but never using any data output from a prior SQL
> statement to seed another.

I don't really want to get bogged down into terminological discussions,
but I will say that pursuing the fewest-queries approach makes for the fastest, most scalable software.

> When the data goes to the page, including the company id, let's say,
> then if we read the company id back from the page or keep the company
> id in memory, it has gone through some sort of purification, perhaps,
> so then we could use that companyid in a subsequent query and we would
> not think of that as navigating, is that correct?

The *user's* navigation through the pages or screens or dialogs of the client application are at another level entirely than what we are talking
about.

> Or is the fact that
> we have retrieved the companyid from the database enough to make it
> off-limits for qualifying subsequent SQL statements?

Suddenly I notice that you phrase many things in terms of what's allowed or forbidden or off-limits or what have you. I wonder why that is?

> This is very enlightening, thanks. The two lingering questions for me
> that I've included above are 1) When, if ever, are you are
> "permitted" (with your approach) to use any data from the DBMS to
> qualify subsequent SQL statements? Is such data "purified" once it
> has gone to the user? 2) How is this as efficient as using data, such
> as foreign key information, to narrow down subsequent queries when it
> would seem to me that it must re-read that which it has already read?

I can't really make sense out of question 1. I hope that for question 2
I've given some indication above. (And I wouldn't say that minimizing the query count is "as efficient"; I'd say it's "way more efficient.")

Marshall Received on Thu Feb 15 2007 - 17:58:53 CET

Original text of this message