Re: Nested Sets vs. Nested Intervals

From: Mikito Harakiri <mikharakiri_nospaum_at_yahoo.com>
Date: 3 Dec 2005 11:58:07 -0800
Message-ID: <1133639887.429237.193880_at_o13g2000cwo.googlegroups.com>


asdf wrote:
> I don't want this:
>
> CREATE TABLE `directorylistings` (
> `lid` int(8) unsigned NOT NULL,
> `cid` int(8) unsigned NOT NULL,
> `name` varchar(100) NOT NULL,
> `a11` int(8) unsigned NOT NULL,
> `a12` int(8) unsigned NOT NULL,
> `a21` int(8) unsigned NOT NULL,
> `a22` int(8) unsigned NOT NULL,
> `url` varchar(150) NOT NULL,
> `title` varchar(100) NOT NULL,
> `description` text NOT NULL,
> PRIMARY KEY (`lid`),
> KEY `right_col` (`a12`,`a22`),
> KEY `atom` (`a11`,`a12`,`a21`,`a22`),
> KEY `left_col_name` (`name`,`a12`,`a22`),
> KEY `cid_name` (`cid`,`name`),
> KEY `url` (`url`),
> FULLTEXT KEY `description` (`description`),
> FULLTEXT KEY `title` (`title`)
> ) ENGINE=MyISAM DEFAULT CHARSET=latin1;
>
> because it's a de-normalized representation of the data and wastes
> space.

Tree leaves and tree nodes in your model serve the different purpose, so that you indeed can insist storing them in a different tables. Space argument, though, is not especially convincing. The most of the space would be taken by the text columns: description, title, and url, and those are represented by variable length strings. The main justification in favor of a single table is conceptual simplicity of having tree nodes and leaves together.

> I would do something like this:
>
> CREATE TABLE `directory` (
> `id` int(6) unsigned NOT NULL auto_increment,
> `name` varchar(100) NOT NULL,
> `a11` int(8) unsigned NOT NULL,
> `a12` int(8) unsigned NOT NULL,
> `a21` int(8) unsigned NOT NULL,
> `a22` int(8) unsigned NOT NULL,
> PRIMARY KEY (`id`),
> UNIQUE KEY `name` (`name`,`a11`,`a12`),
> KEY `parent` (`a21`,`a22`)
> ) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=1 ;

Be warned about foreign key constraint: you can't actually declare it. There is a subtle but important detail: root node

[1 -1]
[1 0]

refers to a parent

[1 ?]
[0 ?]

which doesn't exist. In the adjacency relation the root refers to null parent, which allows to declare a formal foreign key integrity constraint.

> CREATE TABLE `listings` (
> `listing_id` int(8) unsigned NOT NULL auto_increment,
> `cat_id` int(6) unsigned NOT NULL default '0',
> `url` varchar(150) NOT NULL,
> `title` varchar(100) NOT NULL,
> `description` text NOT NULL,
> `rating` tinyint(2) unsigned NOT NULL default '0',
> `lft_index` double unsigned NOT NULL default '0',
> `rgt_index` double unsigned NOT NULL default '0',
> PRIMARY KEY (`listing_id`),
> KEY `cat_id` (`cat_id`),
> KEY `lft_index` (`lft_index`),
> KEY `rgt_index` (`rgt_index`),
> KEY `rating` (`rating`),
> FULLTEXT KEY `description` (`description`),
> FULLTEXT KEY `title` (`title`)
> ) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=1 ;
>
> so you can select the 10 top rated listings in a specific category
> efficiently by doing this:
>
> SELECT listings.*
> FROM listings
> WHERE directory.a11*node.a21 >= directory.a21*node.a11
> AND directory.a11*node.a22 >= directory.a21*node.a12
> AND listings.lft_index
> BETWEEN node.a11/node.a21 - 0.00000001 AND
> (node.a11-node.a12)/(node.a21-node.a22) + 0.0000001
> ORDER BY listings.rating DESC LIMIT 0, 10
>
> so you can select the websites within a specific category that contains
> "Software" by doing this:
>
> SELECT listings.*
> FROM listings
> WHERE directory.a11*node.a21 >= directory.a21*node.a11
> AND directory.a11*node.a22 >= directory.a21*node.a12
> AND listings.lft_index
> BETWEEN node.a11/node.a21 - 0.00000001 AND
> (node.a11-node.a12)/(node.a21-node.a22) + 0.0000001
> AND listings.title LIKE '%Software%'
>
> Is there a better way to do this?

This is an interesting question that I don't know the answer yet. Your first query as it is written would find all the children by the index range scan and then reorder and filter top 10. Efficiency depends on the size of the subtree. Even if we simplify your query to find a single top rated page, it is still going to be executed this way. I can imagine (but fail to figure out) a more sophisticated index on combination of interval columns and ranking column that would find the top ranking page within the interval fast. Received on Sat Dec 03 2005 - 20:58:07 CET

Original text of this message