Re: Nearest Common Ancestor Report (XDb1's $1000 Challenge)

From: Hugo Kornelis <hugo_at_pe_NO_rFact.in_SPAM_fo>
Date: Tue, 25 May 2004 00:48:07 +0200
Message-ID: <mts4b09m5olm07v0b9ppienrdk8au4md6c_at_4ax.com>


Hi Neo,

I promised a second post - here it is.

When I was testing my algorith against larger amounts of data, I noticed that it was bugged. The algorithm to determine all ancestors of each thing may sometimes try to insert the same thing/ancestor combination with two different path lengths. I had to use a GROUP BY and a MIN(..) to solve this.

I didn't like the performance hit I took (up to 14.3 ms avg) - still better than the 16 ms that XDb1 needs on this same machine but I figured I could do better.

I changed my model to use integer columns with the IDENTITY property (SQL Server's name for autoincrementing artificial ID) as primary keys. This improved the performance of my report generation; the average execution time is now down to 11.0 ms. Note that my report still shows the names of the things in the hierarchies, not the ID's. (SQL and output are down below).

Incidentally, this quicker implementation is also better equipped to adapt to many of the changes you suggested after your original post. You want different things to share the same name? Easy - just remove the unique constraint on thingName and you're done. You want things without name? Remove thingName from the things table, create a new table thingNames with columns thing (primary key / foreign key to things) and thingName.
You want things with multiple names? As above, but primary key on thingNames table should span both columns.

All this doesn't matter for the challenge you set, as these were all requirements that you brought into the equation at a later time. What does matter is my improved model, that you'll find below. The average execution time is 11.0 ms (as you can see from the output). This is based on an execution on my desktop system (1.3 GHz Athlon, 256 MB RAM, 2 harddisks at 40 GB / 7200 rpm each) with several open applications. XDb1 takes 16 ms for generating the report on the same machine with the same applications still active.

(Start of SQL for improved model)

  • define some tables to hold the user input create table things (thing int not null identity, thingName varchar(20) not null unique, primary key (thing)) create table classes (class int not null identity, className varchar(20) not null unique, primary key (class)) create table HierarchyTypes (hierarchy int not null identity, hierarchyName varchar(20) not null unique, primary key (hierarchy)) create table ClassOfThings (thing int not null references things on update cascade on delete cascade, class int not null references classes on update cascade on delete cascade, primary key (thing)) create table ClassHierarchy (SubClass int not null references classes on update cascade on delete cascade, SuperClass int not null, -- references classes on update cascade on delete cascade, primary key(SubClass, SuperClass))
  • update and delete triggers needed to cascade updates and deletes of classes to SuperClass;
  • SQL Server doesn't accept this second DRI action, unfortunately. create table attributes_int (thing int not null references things on update cascade on delete cascade, attribute varchar(10) not null, value int not null, primary key (thing, attribute)) create table attributes_char (thing int not null references things on update cascade on delete cascade, attribute varchar(10) not null, value varchar(100) not null, primary key (thing, attribute)) create table hierarchies (thing int not null references things on update cascade on delete cascade, otherthing int not null, -- references things on update cascade on delete cascade, hierarchy int not null references HierarchyTypes on update cascade on delete cascade, primary key (hierarchy, thing, otherthing), check (thing <> otherthing))
  • update and delete triggers needed to cascade updates and deletes of things to otherthing;
  • SQL Server doesn't accept this second DRI action, unfortunately.
  • this table is used to generate the report create table ancestors (thing int not null, ancestor int not null, dist int not null, primary key (thing, ancestor)) create unique nonclustered index testindex on ancestors(ancestor, thing)
  • this table will hold the report - here, the things' name is used create table NCancestor (ThingX varchar(20) not null, ThingY varchar(20) not null, CmnAnc varchar(20) not null, Dist int not null, primary key (ThingX, ThingY, CmnAnc)) go
  • this procedure creates the report create proc CommonAncestors (_at_hierarchyName varchar(20)) as declare _at_hierarchy int select _at_hierarchy = hierarchy from HierarchyTypes where hierarchyName = _at_hierarchyName
  • delete data from previous execution truncate table ancestors truncate table NCancestor
  • starting data: thing itself is dist 0, direct leader is dist 1 insert ancestors with (tablockx holdlock) (thing, ancestor, dist) select distinct thing, thing, 0 from things with (tablockx holdlock) union all select thing, otherthing, 1 from hierarchies with (tablockx holdlock) where hierarchy = _at_hierarchy
  • use iteration to determine indirect leaders while (_at__at_rowcount<>0) insert ancestors with (tablockx holdlock) (thing, ancestor, dist) select a1.thing, a2.ancestor, min(a1.dist + a2.dist) from ancestors AS a1 join ancestors AS a2 on a2.thing = a1.ancestor where not exists (select * from ancestors AS a3 where a3.thing = a1.thing and a3.ancestor = a2.ancestor) group by a1.thing, a2.ancestor
  • now find nearest common ancestor for each pair of things insert NCancestor with (tablockx holdlock) (ThingX, ThingY, CmnAnc, Dist) select t1.thingName, t2.thingName, ta.thingName, a1.dist + a2.dist from ancestors AS a1 with (tablockx holdlock) join ancestors AS a2 on a1.ancestor = a2.ancestor and a1.thing < a2.thing join things AS t1 on t1.thing = a1.thing join things AS t2 on t2.thing = a2.thing join things AS ta on ta.thing = a1.ancestor where not exists (select * from ancestors AS a3 join ancestors AS a4 on a3.ancestor = a4.ancestor and a3.thing < a4.thing where a3.thing = a1.thing and a4.thing = a2.thing and a4.dist+a3.dist < a1.dist+a2.dist) go
  • dummy execution on empty table forces compilation of stored proc
  • (I assume that Xdb1 uses a precompiled procedure as well) exec CommonAncestors _at_hierarchyName = 'whatever' go
    --
  • now add the predefined data insert things (thingName) values ('god') insert things (thingName) values ('army') insert things (thingName) values ('trinity') insert things (thingName) values ('john') insert things (thingName) values ('mary') insert things (thingName) values ('luke') insert things (thingName) values ('fido') insert things (thingName) values ('laptop1') go insert classes (className) values ('force') insert classes (className) values ('church') insert classes (className) values ('person') insert classes (className) values ('dog') insert classes (className) values ('computer') go insert HierarchyTypes (hierarchyName) values ('leader') go insert ClassOfThings (thing, class) values (2, 1) insert ClassOfThings (thing, class) values (3, 2) insert ClassOfThings (thing, class) values (4, 3) insert ClassOfThings (thing, class) values (5, 3) insert ClassOfThings (thing, class) values (6, 3) insert ClassOfThings (thing, class) values (7, 4) insert ClassOfThings (thing, class) values (8, 5) go insert attributes_int (thing, attribute, value) values (4, 'age', 35) insert attributes_int (thing, attribute, value) values (5, 'weight', 130) insert attributes_char (thing, attribute, value) values (6, 'color', 'red') go insert hierarchies (thing, otherthing, hierarchy) values (2, 1, 1) insert hierarchies (thing, otherthing, hierarchy) values (3, 1, 1) insert hierarchies (thing, otherthing, hierarchy) values (4, 2, 1) insert hierarchies (thing, otherthing, hierarchy) values (5, 2, 1) insert hierarchies (thing, otherthing, hierarchy) values (5, 3, 1) insert hierarchies (thing, otherthing, hierarchy) values (6, 3, 1) insert hierarchies (thing, otherthing, hierarchy) values (8, 4, 1) insert hierarchies (thing, otherthing, hierarchy) values (8, 5, 1) insert hierarchies (thing, otherthing, hierarchy) values (7, 4, 1) insert hierarchies (thing, otherthing, hierarchy) values (7, 5, 1) insert hierarchies (thing, otherthing, hierarchy) values (7, 6, 1) go
    --
  • random name / hierarchy generation for lots of data goes here
    --
    print 'Initialization done - starting actual test' go
    --
  • generate the report - show elapsed time declare _at_started datetime, @ended datetime set _at_started = current_timestamp exec CommonAncestors _at_hierarchyName = 'leader' exec CommonAncestors _at_hierarchyName = 'leader' exec CommonAncestors _at_hierarchyName = 'leader' exec CommonAncestors _at_hierarchyName = 'leader' exec CommonAncestors _at_hierarchyName = 'leader' exec CommonAncestors _at_hierarchyName = 'leader' exec CommonAncestors _at_hierarchyName = 'leader' exec CommonAncestors _at_hierarchyName = 'leader' exec CommonAncestors _at_hierarchyName = 'leader' exec CommonAncestors _at_hierarchyName = 'leader' set _at_ended = current_timestamp select convert(char(24), _at_started, 121) as started, convert(char(24), _at_ended, 121) as ended, convert(char(14), (_at_ended - @started), 114) as "elapsed (10 executions)"
  • display the results select * from NCancestor order by ThingX, ThingY go
    --
  • cleanup after execution drop proc CommonAncestors drop table NCancestor drop table ancestors drop table hierarchies drop table attributes_int drop table attributes_char drop table ClassHierarchy drop table ClassOfThings drop table HierarchyTypes drop table classes drop table things go

(Output after running the above script)

Initialization done - starting actual test

started                  ended                    elapsed (10 executions) 

------------------------ ------------------------ -----------------------
2004-05-25 00:37:03.780 2004-05-25 00:37:03.890 00:00:00:110

ThingX ThingY CmnAnc Dist
-------------------- -------------------- -------------------- -----------

army                 fido                 army                 2
army                 john                 army                 1
army                 laptop1              army                 2
army                 luke                 god                  3
army                 mary                 army                 1
army                 trinity              god                  2
fido                 laptop1              john                 2
fido                 laptop1              mary                 2
god                  army                 god                  1
god                  fido                 god                  3
god                  john                 god                  2
god                  laptop1              god                  3
god                  luke                 god                  2
god                  mary                 god                  2
god                  trinity              god                  1
john                 fido                 john                 1
john                 laptop1              john                 1
john                 luke                 god                  4
john                 mary                 army                 2
luke                 fido                 luke                 1
luke                 laptop1              trinity              3
mary                 fido                 mary                 1
mary                 laptop1              mary                 1
mary                 luke                 trinity              2
trinity              fido                 trinity              2
trinity              john                 god                  3
trinity              laptop1              trinity              2
trinity              luke                 trinity              1
trinity              mary                 trinity              1


-------------------------------------------------------------------------

Best, Hugo
--

(Remove _NO_ and _SPAM_ to get my e-mail address) Received on Tue May 25 2004 - 00:48:07 CEST

Original text of this message