Re: Navigation question

From: Marshall <marshall.spight_at_gmail.com>
Date: 16 Feb 2007 07:48:29 -0800
Message-ID: <1171640909.586058.155010_at_l53g2000cwa.googlegroups.com>


On Feb 16, 5:53 am, "dawn" <dawnwolth..._at_gmail.com> wrote:
> On Feb 15, 7:04 pm, "Marshall" <marshall.spi..._at_gmail.com> wrote:
>
> > You might think you're thinking at the conceptual level, but
> > you're not. You're thinking at the implementation level,
> > aka physical. Neither the conceptual level nor
> > the logical level has any concept of a current
> > location, for example.
>
> I figured it out -- I'm designing for the USER and when designing at
> the logical level, I'm also designing workflow. After pulling up
> information about a company, for example, the user goes to one of the
> orders listed. From there, they might go to a product that is part of
> a line item on the order. This is similar to users who go to a web
> site and from there navigate to other web sites. Users navigate.
> Their workflow includes a navigation path. So, at the conceptual
> level we can navigate, users navigate, and at the physical level we
> navigate. But there is one thread of theory within software
> development that keeps pushing against navigation, as if it were a bad
> thing.

I'm sorry but this is totally beside the point. Neither users' mental model nor application architecture has any determining role in how a query is written.

> > > > So, you could make an argument about there being a difference; in
> > > > practice I would be astonished if there was one.
>
> > > Hmm. Ok. But is there something wrong with doing the "navigation"
> > > that I did with my statements?
>
> > In this particular simplified case it doesn't matter one way or
> > the other, from a performance standpoint. In a real-world
> > scenario, it may matter a lot.
>
> Agreed. It might then be better one way or the other. I asked if
> there was something wrong with doing the "navigation" that I did and
> you said that in this case it didn't matter one way or the other, but
> then added "from a performance standpoint." So, does it matter one
> way or another from some other standpoint?

There are complex cognitive and productivity differences that are subtle and hard to think about. However the performance argument is clear-cut, so I'm focusing on that.

> > > > Index scans are ridiculously fast. Any difference is speed has
> > > > been less than the difference from one query run to the next;
> > > > it's in the noise. And the more complicated the query is, the
> > > > harder your technique will be and the proprtionally smaller any
> > > > already small effect the removal of the index scan will be.
>
> > > But my technique was not difficult at all and I wasn't in a big,
> > > conscious way attempting to save on performance, it was a by-product
> > > of the approach instead.
>
> > There was no by-product; you didn't save on performance. Furthermore
> > as Tony mentioned you have now stamped your query plan into
> > concrete,
>
> No more than either of your SQL statements did. I provided
> information I had to the SQL statements. The statement decided how to
> execute.

Apparently you missed Tony's point completely. I'm going to snip out a bunch and focus on just one case.

> > > Whether working with VSAM files, Pick, or SQL-DBMS's,
> > > this approach has worked and has never been contested by anyone as
> > > being either difficult to maintain or to perform poorly.
>
> > If your tables are a hundred rows it makes no noticable perfomance
> > differences if you fetch them one at a time or all at once. There's
> > no perceptible difference between an O(n log n), O(n^2), or O(n^3)
> > algorithm
> > if n is small enough. As n grows, so does the difference.
>
> Please zero in on the specific topic of navigation. I do understand
> the set processing concerns and iteration. That is NOT the topic, nor
> does the topic need to navigate there.

I *am* talking about navigation. In particular I'm pointing out how your way of doing it doesn't scale. Splitting queries = bad, for performance reasons, not religious ones.

> > If you have a three level deep static hierarchy, with 10^4 rows in
> > the outer table and 10^4 fanout at each level, and you want to
> > aggregate across the lowest level grouped by the highest level,
> > the difference between doing a sum in a single query and a nested
> > lookup execution is a factor of 10^8 network bandwidth and
> > 10^12 packet count. It can mean the differerence between
> > getting an answer in 5 seconds vs. days.
>
> > I have encountered much larger problems in the field; this
> > is not even extreme.
>
> This is not about network bandwidth. I've been doing client-server
> development for quite some time and am fully aware of round-trip
> costs.

Network bandwidth is one strong reason why splitting up queries is a terrible idea.

> Please note that I did no more round trips with my example
> than you did with yours. I navigated, however, where one could
> suggest that you did not.

Okay. Let's make the example really simple and concrete. Please try to focus on the specific example I'm pointing out.

We have a company that's been around for a while. They have a lot of long-standing customers. Customer orders tend to be fairly detailed, with lots of little parts.

The customers table has 100 rows.
The invoices table has 10,000 rows; customers average   100 invoices.
The invoice line items table has 1,000,000; invoices   average 100 line items.

The head sales guy says give me a report with every customer id and their total spend, and the one part number they bought the most of.

We can take a navigational approach: go to each customer; get each of their invoices; get each invoice line item. If the count is greater than the max for this customer, set it. Add the cost.

In pseudocode:

ccount = count of customers
declare spend[ccount] = all zeroes
declare max[ccount] = all zeroes
declare item[ccount] = all zeroes
fetch all customer ids
for each customer c
  fetch all invoices ids for customer c
  for each invoice i
    fetch all lineitem count, cost, item for i     for each lineitem l

      spend[c] = spend[c] + l.count * l.cost
      if (l.count > max[c])
        item[c] = l.item

    end
  end
end

Or, in sql:

select c.customerid, sum(l.count*l.cost), max(l.count)   from customers c natural join
    invoices i natural join
    lineitems l

You say you're fully aware of network cost issues. It should take about a minute to estimate the number of network round-trips, and the total byte count transferred. Let's just assume every quantity is 32 bits. Compare the round-trip count and byte count for the two approaches. Let us know what you conclude.

Marshall Received on Fri Feb 16 2007 - 16:48:29 CET

Original text of this message