Re: Nearest Common Ancestor Report (XDb1's $1000 Challenge)

From: Hugo Kornelis <hugo_at_pe_NO_rFact.in_SPAM_fo>
Date: Mon, 24 May 2004 02:09:47 +0200
Message-ID: <2od2b0hdbbe1mot4ic3qvvn72s54uiei7n_at_4ax.com>


On Sun, 23 May 2004 00:13:06 GMT, Nick Landsberg wrote:

>
>[ Everything Snipped ]
>
>OK folks, I have found this discussion fascinating
>and educational. However I would like to add
>one more item into the equation, and it is not
>theory (although I notice that c.d.theory is
>in the distribution list) ---
>
>Given the possible implementations, how long
>would this query run on a 1 million record
>database? (That's actually small by today's standards)
>And then for a 10 million record database.
>
>Pick your own hardware, e.g. raid arrays,
>2.5 GHZ CPU's etc., but tell me how long
>this query would run, please.
>
>Theory is nice, but, in my experience,
>performance is what sells systems to
>customers.

Hi Nick,

Fair question. In between answering Neo's latest messages, I was constantly switching to my Query Analyzer window. I've been busy writing out a script that will generate millions of rows in the things and add random dependencies in the "leader" hierarchy, so I could test some things.

Some disclaimers to start with:
1. I think I'm pretty good at designing tables and writing SQL. But I have second to none experience in tuning heavy-duty systems. I can (and probably will) tweak some things here and there and gain some performance, but it's not what I do best.
2. I don't have the money to spare on the kind of hardware that would be present in a shop that runs this kind of queries on 1 million record databases. I'll do some tests with large numbers, but all I can offer is a Desktop PC, powered with one 1.3GHz AMD Athlon, 256 MB of RAM and equipped with two 40GB 7200rpm IDE harddisks (I have the log on one of the HDs and the data on the other - for this kind of queries, I'd sure like to plant tempdb on a third HD).

With only this computer to my disposal, I'll have to trim down the number of rows considerably. It's not that my hardware can't handle a table with 1 million (or even 10 million) rows - but while waiting for the results of my preliminary tests (using "only" roughly 0.5 million things), it suddenly occured to me that the report would be based on the Carthesian product of the things table with itself - almost 250,000,000,000 rows for my preliminary test, more than 121,000,000,000,000 rows if I would run the complete test set I prepared (sporting over 11 million things). Just sending the report to the screen would probably take days!! And I'd need to add some really beefy hardware to store it.

I just saw that it's past 2AM already, so I'll have to get back to you later with test results from a "slightly" smaller test set.

Two final notes:
1. I did already catch Neo's reply to you. I'll still try to run the tests I intended, even though Neo probably won't. Just out of curiosity. 2. I expect my implementation to be quite slow when dealing with a large numbers of rows. In fact, I was quite surprised when my first attempt already managed to beat XDb1's execution time. Some time ago, I saw some messages by Joe Celko in another newsgroup about an alternate way to store trees. I don't recall the details, but from what I still know, I expect it to be lots (and lots and lots) faster.
After finishing the tests on my own implementation, I might go try and find his method, see what performance that one will yield. Just out of
curiosity.

Best, Hugo

-- 

(Remove _NO_ and _SPAM_ to get my e-mail address)
Received on Mon May 24 2004 - 02:09:47 CEST

Original text of this message