Re: More pain and sufferring with Tropashko's materialized path...

From: Vadim Tropashko <vadimtro_at_yho.cm>
Date: Mon, 20 Oct 2003 16:47:41 -0700
Message-ID: <OD_kb.30$Rt2.207_at_news.oracle.com>


VC wrote:

> Would you be kind enough to provide more detail about the new dense
> encoding schema ?

Nested Intervals Model (http://www.dbazine.com/tropashko4.html) labeled each node with binary rational number, which is essentially a pair of integer numbers. Given that a set of all pairs of integer numbers can be enumerated, it easily follows that we can code each node with one integer only. This is not a very interesting idea, however. We save some storage, but for all practical querying purposes we would have to translate into binary rational numbers. Plus who cares about storage limitations nowadays?

We'll explore a different approach. We employ a triangular navigation method which is similar to the enumeration mentioned in the previous paragraph, but we not refer to numerator and denominator values explicitly.

We start with node <x=1,y=1/2> which we label with 1, move diagonally left down along the "red" line (picture included into forthcoming dbazine article), find no more nodes. Next, we navigate along the red diagonally aligned line just above it. Our starting point is <x=1,y=3/4> which we label with the next integer 2. On the next step we move diagonally left down along the "red" line, find node <x=1/2,y=1/4> and label it with integer 3. On the following step we'd navigate the next red diagonal just above our last one, and label nodes 4,5,6, and 7.

Here is labeling of the first 14 nodes in the tree:

  N PATH
-- ----------

  1 .1
  2 .1.1
  3 .2
  4 .1.1.1
  5 .1.2
  6 .2.1
  7 .3
  8 .1.1.1.1
  9 .1.1.2
10 .1.2.1
11 .1.3
12 .2.1.1

13 .2.2
14 .3.1

In other words numbers at every diagonally aligned line form an arithmetic progression. Hence, we'll call them arithmetic siblings. Reader shouldn't confuse those with real siblings, which are, as you might recall from the introductory article, aligned along the lines that meet at breadth-first convergence point. The other distinction is that each line drawn trough the sibling nodes has a slope less than that of the diagonal.

Reader might have already noticed that every first (or left-most) child is twice as big as its parent. Therefore, we call all nodes on each vertical line as geometric descendants.

An immediate property of the above labeling schema is density: we have all integer positive numbers utilized. Since some database implementations have integers of limited range only, density might become an attractive feature, because one would be able to store and manipulate bigger trees before hitting arithmetic overflow exception.

Partial Order Relation

Single integer labeling allows very simple definition of transitive closure relation. Given 2 nodes i and j such that i <= j let k be the maximal integer satisfying i*2^k <= j. For example, if i=1 and j=11, then k=3 and i*2^k=8. Geometrically, the i*2^k is the intersection of arithmetic siblings line originated in j and geometric descendants line originated in i.

Consider binary relation ancestor of satisfying the following condition:

i ancestor of j <=> j-i*2^k < 2^k-1

Informally, we require j to be closer to i*2^k than i to k, if we adjust for "geometric" distance between i and i*2^k. For example, let i=1, j=12, then k=3 and i*2^k=8. We have j-i*2^k = 4 and 2^k-1 = 4, therefore 12 is not an ancestor of 1. Indeed, 12 = 3*22 is an ancestor of node 3, which is a sibling of node 1. (If we allowed to consider the roots of the forest as siblings).

<proof that ancestor of is indeed a transitive relation snipped>

Distance function

Distance function is a straightforward implementation of ancestor relation defined in the previous section

function dist ( child integer, parent integer ) RETURN integer IS

    ret integer;
    arifgeom integer;
    pwr integer;
BEGIN
    if parent > child then

       RETURN -1;
    end if;
    if parent = child then

       RETURN 0;
    end if;
    arifgeom := parent;
    pwr := 0;
    while arifgeom*2 <= child loop

       arifgeom := arifgeom*2;
       pwr := pwr+1;

    end loop;
    if child-arifgeom < arifgeom/(parent*2) then

       RETURN pwr;
    end if;
    RETURN -1;
END; select dist(11,3), dist(12,3), dist(13,3), dist(14,3) from dual -1 2 2 -1

In theory, distance function is all that is required in order to answer hierarchical queries. In practice, however, query with distance function implies scanning the whole hierarchy table and applying ancestor of predicate to every row. With this method we obviously couldn't advance beyond small hierarchies.

Querying node's Ancestors

A typical query, that doesn't require scanning the whole hierarchy might ask to find the list of node's ancestors. Similar to materialized path, nested intervals, and binary rational labelings answering such a query doesn't involve querying hierarchy relation at all. Parent label calculation in our new integer labeling schema is simplicity itself. If the child number is even, then, child divided by 2 gives us the parent. Otherwise we don't jump to the parent right away, but instead, navigate to adjacent (real) sibling with smaller encoding. Technically this means, subtracting 1, and dividing the result by 2:

function immediate_parent ( child integer ) RETURN integer IS

    ret integer;
BEGIN
    ret := child;
    while mod(ret,2)=1 loop

       ret := (ret-1)/2;
    end loop;
    RETURN ret/2;
END;
/

select immediate_parent(21) from dual
5

Therefore, queries involving ancestors can be answered efficiently by iterative navigation through the ancestors chain. Technically, this iterating calls to the immediate_parent function can be wrapped inside a table function. (Another common name for table function is virtual view).

If a list of ancestors is used in as a part of more complex report, then, it would make sense creating index on a node number column. (Primary key on the node number might be a good idea, anyway). Assuming that the list of ancestors is small compared to total number of rows in the hierarchy table, then it would be more efficient to access each individual node on the ancestors list by unique index scan . Received on Tue Oct 21 2003 - 01:47:41 CEST

Original text of this message