Re: Nearest Common Ancestor Report (XDb1's $1000 Challenge)

From: Hugo Kornelis <hugo_at_pe_NO_rFact.in_SPAM_fo>
Date: Tue, 25 May 2004 00:10:28 +0200
Message-ID: <nds4b0p9lcmi06l67mdfloel7gflibbrnk_at_4ax.com>


On 23 May 2004 13:49:30 -0700, Neo wrote:

Hi Neo,

(snip)

>The provided RM db's schema is not as generic as XDb1's in the
>following respects:
>1. It does not model a thing's name as an attribute.

It does, and then goes on to make that attribute the primary key.

>2. It does not model attributes with multiple values in a normalized
>manner.

Your original post did not contain that requirement. I already showed several way to enhance my model to allow this, all (unlike XDb1) without the need to contact the vendor.

>3. It requires a separate table for values of each each data type!!!

If that bothers you, replace them with one table and use the (SQL Server proprietary) datatype "sql_variant". I chose to use two tables, because I like to be sure when I can or can't subtract values from a columns and because I prefer portable syntax over proprietary syntax when feasible. Using two tables doesn't make my model less generic.

>4. It does not model the class hierarchy in the appropriate table thus
>making it inaccessible for reporting.

Your original post didn't point out that classes may form a hierarchy as well. I already supplied an extra table to model that hierarchy, thereby making it available for reporting.

>5. It does not model a way to normalize names of things. (ie
>person/john and dog/john)

First, I don't know why you use the term normalization here.

Second, it's XDb1 that can't display the names this way. I know that you have plans to adapt XDb1, to include this functionality. But as it stands, this output is not an option for the XDb1 model.

Third, my implementation does model a way to show the class of a thing in front of it's name. Simply join in the classes table and use SQL Server's '+' operator to concatenate the strings together. If the report you wanted me to produce for the challenge had already contained the class, I would have made my query this way right from the start.

>6. It does not model a way to normalize parts of a name (ie 'simth' in
>'john smith' and 'mary smith')
>7. It does not model a way to normalize symbols in names. (ie 'o' in
>'john' and 'god')

I'm aware that XDb1's philosophy as to what "normalizing" means is quite different from the relation model's philosophy. But since you asked, in your original message, for an implementation "using the relational model", there's no point complaining that I normalized data according to the relational normalization rules. It's what you asked for, isn't it?

>Also the provided RM db's data is not as normalized as XDb1's. RM's
>has duplicate data which needs to be replaced with proper refs (keys)
>to the one and original data within the db. A proper ref must be
>independent of data being represented, otherwise problems arise.

The relational model allows for the natural key to be used as a table's primary key and for foreign keys (references) to use that (natural) primary key. Therefor, my choice to use the name as the primary key in the things table and as foreign key in the other tables is proper within the context of the relational model.

The debate as to wether using a natural key is a Good Thing or slapping an autoincrementing integer key onto every table is a Better Thing may still cause Holy Wars in some newsgroups. I won't thread there now, though my choice of words (and my provided model) might be taken as a hint to my preference. But I'm not anal about it and I will (and do) use artificial keys when I think that there's a good reason to do so.

> For
>example, in RM's, two things can't have the same name and can't
>properly represent an unnamed thing.

That was not a requirement for the original challenge. Also, I have already hinted at possible ways to solve this issue. I'll post another message later today with more on this issue.

>Some differences between XDb1's and RM's reporting algorithms:
>1. RM's report is more complete. It reports all of the nearest common
>ancestors, where as XDb1's only reports any one the nearest common
>ancestors.
>
>2. RM's report algorithm doesn't access the atttributes of any thing,
>where as XDb1's can. For instance, if a thing is unnamed, XDb1 prints
>the thing's classes and attributes!!!

Good point. I have to concede that it would take rather nasty SQL to mimic this behaviour. I'm glad you didn't include this function of XDb1 in the challenge, otherwise I'd have had a hard time matching your performance on the sample data!

>3. XDb1's algorithm is extremely memory efficient. The major variable
>in calculating memory requirements are 3 integer arrays that must be
>larger than the depth of the hierarchy, regardless of the number of
>things in the hierarchy. Thus for a hierarchy 100 deep, XDb would need
>300 integers for these arrays. RM's creates an intermediate table
>whose size can be much much larger since there can be nearly an
>infinite number of things in a 100-depth hierarchy!

The tests I ran with large amounts of data show that my procedure's running time rises exponentially when more data is added. My model is definitely not a good implementation for generating this report out of any serious amount of data.

The description of your algorithm is intriguing to say the least. Even while knowing that there apparently is a way to generate this report with no more intermediate storage that 300 integers for a 100-depth hierarchy, I still can't figure out how you do it! Care to elaborate?

>With a very small dataset, on a off-line, 500 Mhz, Dell PowerEdge, 512
>MB, Ultra SCSC II, 10,000 RPM HDs, NT 4 sp6a, SQL Server 7: RM's
>solution takes 65 ms on avg, but as low as 30ms and as high as 1186
>ms.

You never replied to my message where I gave some possible reasons why your server gave much slower response than mine. Did you check the collation? This will definitely have a SEVERE impact on the execution time of my original entry for the challenge.

> XDb1's simplified solution (only printing thing's ID)

is mildly interesting, but of no relevance for the challenge. Plus, it doesn't produce human-readable output, so it's of no relevance to anybody else either.

> which very
>roughly approximates RM's solution (prints char-based keys which
>happen to encode thing's name also),

Not "happen to" - I deliberately chose to use a natural key, even though I am aware that this would hamper the performance. Please check a soon-to-be-posted other message to find out why I have made an alternate implementation (using artificial keys, like you so much seem to desire) and how that actually increased my performance!

> takes as high as 2.23 ms. XDb1's
>generic solution (which is normalized down to atomic symbols and
>accesses the same symbol 'o' in the db when printing 'john' and 'god'
>on the report and will print an unnamed thing's classes and
>properties), takes 2 or 3 ms.
>
>Summary of processing time for a small data set (8 things):
>RM's simple solution: 65 ms (typical)
>XDb1's simpler solution: 2.23 ms (max)
>XDb1's generic solution: 2 or 3 ms (typical)
>
>On a larger hierarchy consisting of 200 goats (5 generations x 40
>goats/generation, each goat having two parents):
>RM's simple solution: 11 sec (lowest of 11,16,15,15,14,11,13,14,14,13)
>XDb1's generic solution: 5.438 sec (only 1 run thus far with beta
>v4.5.0)
>
>Please note: the above are apple-to-orange comparisions.

Yes - especially if you have SQL Server set to do case insensitive string comparisons, that incur quite some overhead. I just checked what happend if I enter "hugo isa person." / "Hugo is 35." (note the capital H in the second Hugo!) in XDb1; it asks me "What is 'Hugo'?" so I can conclude that XDb1 uses case sensitive string comparisons. There's no reason to deny SQL Server that priviledge.

Best, Hugo

-- 

(Remove _NO_ and _SPAM_ to get my e-mail address)
Received on Tue May 25 2004 - 00:10:28 CEST

Original text of this message