Re: Nearest Common Ancestor Report (XDb1's $1000 Challenge)

From: Hugo Kornelis <hugo_at_pe_NO_rFact.in_SPAM_fo>
Date: Tue, 25 May 2004 01:25:49 +0200
Message-ID: <dmv4b05cafdulv5rv11681ui91r259tn3l_at_4ax.com>


On Mon, 24 May 2004 02:33:10 GMT, Nick Landsberg wrote:

Hi Nick,

(snip)
>
>Thanks Hugo,
>
>The reason why I posed the query is not so much to
>make work for you, but to get a flavor of how the
>two technolgies compare /as they are now/.

Your remark made me curious to find out how my model would hold against more data. I created a script to generate lots of random data, ensuring that the hierarchy would remain connected, strictly top-down yet distributed as random as possible. I think I succeeded (generation script available, in case anyone is interested).

The setup I used for these tests is:
* SQL Server 2000 SP3a, running as a service on my desktop computer, with only one active application (Query Analyzer). * 1.3 GHz Athlon processor / 256 MB RAM. * Two 40GB 7200 rpm IDE Harddisks. The LOG file for my test DB is on one of them. The DATA file for my test DB, plus the DATA and LOG file for tempdb (used by SQL Server to store intermediate result sets) is on the other harddisk. My Windows page file is on that hard disk as well (though on another partition).
* I made sure that both the test database (DATA and LOG!!) and tempdb were large enough to hold all intermittant data and the result. SQL Server will automatically allocate extra disk space if a database needs more room, but repeatedly allocating extra disk space is a performance killer (for the first execution only!!), so I decided to preallocate the space. * For all tests, I first recreated the tables and procedure, filled the tables with the original data and then added extra rows in the things and hierarchies tables.
* The results below show: number of rows in the things and hierarchies table (indicating input volume), number of rows in ancestors and NCancestor tables (indicating size of intermediate and output table) and time elapsed for one execution of the stored procedure. * I did some minor tweaking to the procedure after carrying out these tests but before posting Neo the code (see other message). I think that the improvement is only marginal - at least not enough to warrant repeating all tests again.

Here are the results I gathered:

things hierarchies ancestors NCancestor elapsed ----------- ----------- ----------- ----------- --------------

18          31          81          153         00:00:00:017  
108         210         1307        5778        00:00:00:513  
208         505         4566        21528       00:00:02:407  
408         1101        16938       83028       00:00:14:377  
608         1700        36619       184528      00:00:51:877  
1008        2900        94318       507528      00:05:03:577  


And two larger tests in a very a-typical setup (running SQL Server in single-user mode - normally only used for maintenance, but apparantly also quicker)

things hierarchies ancestors NCancestor elapsed ----------- ----------- ----------- ----------- --------------

1000        2883        94255       499500      00:04:00:250  
2500        7381        464258      3123750     01:02:13:580  

As you can see, elapsed time grows exponentially as the size of the input increases.

Can the above performance be improved upon? Yes, of course. If I had unlimited money to spare, I'd start with the following setup:

  • Have SQL Server run on a dedicated server. Enable only the services that SQL Server needs, shut down everything else.
  • Fastest processor currently available. Though SQL Server can use multiple processors, I don't think they would help for this particular query (but if money is not an issue, I'd still try, just to be sure).
  • As much RAM as the server will hold. Over 2GB is only used by SQL Server Enterprise Edition, so get a license for that version as well.
  • At least five seperate I/O devices: two stripe sets (for test DB Data and tempdb Data) (the more physical disks in the stripe set, the better), and three non-striped disks (one for test DB Log, one for tempdb Log and one for operating system, including pagefile).
  • I'd hire a tuning expert. Though I do know a few tricks, I'm definitely not an expert on fine-tuning SQL Server.
  • And I'd buy Joe Celko's latest book on trees and hierarchies in SQL. His alternative approach I hinted at in an earlier message will not do unfortunately - I expect it to be blindingly fast, but it can only handle hierarchies where each thing has at most one parent. But maybe he's got some other neat tricks in his book that I might want to steal.

Another thing I'd consider is moving some work from the stored procedure to an insert trigger. I didn't do this for the challenge, of course, but if this were to become a real production system, I'd definitely try to find out if insert/update/delete performance for typical daily data entry tasks would be acceptable if I move some preparations for this report to the trigger.

But I'm also sure that there's no way that I would ever get this procedure to perform adequately (unless I'd find a completely different approach in Celko's book). All performance gains that better hardware can buy are linear; the performance hit for extra input data is exponential. That means that there's no way to keep the performance aceptable as the data continues to grow.

Is this a surprise to me? No, it isn't. Frankly, I've never considered a relational database as a good tool for storing and traversing trees. There are many other tools out there more suited for that task.

I never entered this discussion to prove that the relational model is well suited for tree-structures, because frankly, I think it isn't.I only entered this discussion because Neo set a challenge and I decided to take it on - and succeeded.

>
>If Neo's techniques prove out, they may
>be interesting for commercial work in several
>years. One never knows. As I alluded to upthread,
>these discussion remind me a lot about the
>CODASYL vs. RDBMS discussions of roughly 20-25
>years ago. Assuming that Neo's ideas pan out,
>I would not presume that they would supplant
>traditional RDBMS's, just as RDBMS's have
>not completely supplanted hierarchical and
>network models. There is more than one way
>to skin a cat, and you use the best tools
>available depending on the size of the cat :)

Agreed. There is no "one size fits all" in our industry.

>
>You and Neo have had an interesting discussion.
>And, what is rare on usenet, haven't let the
>discussion degenerate into a flame war.
>I congratulate you both!

Thanks, Nick!

Best, Hugo

-- 

(Remove _NO_ and _SPAM_ to get my e-mail address)
Received on Tue May 25 2004 - 01:25:49 CEST

Original text of this message