Re: Nearest Common Ancestor Report (XDb1's $1000 Challenge)

From: Neo <neo55592_at_hotmail.com>
Date: 23 May 2004 13:49:30 -0700
Message-ID: <4b45d3ad.0405231249.3a2a28f6_at_posting.google.com>


> OK folks, I have found this discussion fascinating and educational.
> Given the possible implementations, how long would this query run
> on a 1 million record database?

It could take hours, days, months, maybe even years, depending on the number of interconnections between things (I believe the human brain performs somewhat similar computations but in a massively parallel manner)!

Sorry, but XDb1 is not suitable for commerical applications yet. It is currently mostly an experimental db meant to test out an alternate data model. IMO, an apple-to-apple comparision hasn't been made yet because of the following differences:

The provided RM db's schema is not as generic as XDb1's in the following respects:
1. It does not model a thing's name as an attribute. 2. It does not model attributes with multiple values in a normalized manner.
3. It requires a separate table for values of each each data type!!! 4. It does not model the class hierarchy in the appropriate table thus making it inaccessible for reporting.
5. It does not model a way to normalize names of things. (ie person/john and dog/john)
6. It does not model a way to normalize parts of a name (ie 'simth' in 'john smith' and 'mary smith')
7. It does not model a way to normalize symbols in names. (ie 'o' in
'john' and 'god')

Also the provided RM db's data is not as normalized as XDb1's. RM's has duplicate data which needs to be replaced with proper refs (keys) to the one and original data within the db. A proper ref must be independent of data being represented, otherwise problems arise. For example, in RM's, two things can't have the same name and can't properly represent an unnamed thing.

Some differences between XDb1's and RM's reporting algorithms: 1. RM's report is more complete. It reports all of the nearest common ancestors, where as XDb1's only reports any one the nearest common ancestors.

2. RM's report algorithm doesn't access the atttributes of any thing, where as XDb1's can. For instance, if a thing is unnamed, XDb1 prints the thing's classes and attributes!!!

3. XDb1's algorithm is extremely memory efficient. The major variable in calculating memory requirements are 3 integer arrays that must be larger than the depth of the hierarchy, regardless of the number of things in the hierarchy. Thus for a hierarchy 100 deep, XDb would need 300 integers for these arrays. RM's creates an intermediate table whose size can be much much larger since there can be nearly an infinite number of things in a 100-depth hierarchy!

With a very small dataset, on a off-line, 500 Mhz, Dell PowerEdge, 512 MB, Ultra SCSC II, 10,000 RPM HDs, NT 4 sp6a, SQL Server 7: RM's solution takes 65 ms on avg, but as low as 30ms and as high as 1186 ms. XDb1's simplified solution (only printing thing's ID) which very roughly approximates RM's solution (prints char-based keys which happen to encode thing's name also), takes as high as 2.23 ms. XDb1's generic solution (which is normalized down to atomic symbols and accesses the same symbol 'o' in the db when printing 'john' and 'god' on the report and will print an unnamed thing's classes and properties), takes 2 or 3 ms.

Summary of processing time for a small data set (8 things): RM's simple solution: 65 ms (typical) XDb1's simpler solution: 2.23 ms (max)
XDb1's generic solution: 2 or 3 ms (typical)

On a larger hierarchy consisting of 200 goats (5 generations x 40 goats/generation, each goat having two parents): RM's simple solution: 11 sec (lowest of 11,16,15,15,14,11,13,14,14,13) XDb1's generic solution: 5.438 sec (only 1 run thus far with beta v4.5.0)

Please note: the above are apple-to-orange comparisions. Received on Sun May 23 2004 - 22:49:30 CEST

Original text of this message