Re: how to suppress carefully a recursive tree

From: David BL <davidbl_at_iinet.net.au>
Date: Thu, 24 Jan 2008 07:38:09 -0800 (PST)
Message-ID: <0b3ac00a-0809-4cdb-9b6b-f3ebb9df6de1_at_s19g2000prg.googlegroups.com>


On Jan 24, 8:49 pm, fj <francois.j..._at_irsn.fr> wrote:
> On 24 jan, 01:43, David BL <davi..._at_iinet.net.au> wrote:
>
>
>
>
>
> > On Jan 24, 1:24 am, fj <francois.j..._at_irsn.fr> wrote:
>
> > > On 23 jan, 16:35, fj <francois.j..._at_irsn.fr> wrote:
>
> > > > On 23 jan, 14:32, David BL <davi..._at_iinet.net.au> wrote:
>
> > > > > On Jan 23, 4:10 pm, fj <francois.j..._at_irsn.fr> wrote:
>
> > > > > > On 23 jan, 01:12, David BL <davi..._at_iinet.net.au> wrote:
>
> > > > > > > On Jan 23, 1:10 am, fj <francois.j..._at_irsn.fr> wrote:
>
> > > > > > > > On 22 jan, 15:52, Jan Hidders <hidd..._at_gmail.com> wrote:
>
> > > > > > > > > On 22 jan, 12:04, fj <francois.j..._at_irsn.fr> wrote:
>
> > > > > > > > > > I know how to suppress a normal tree but I meet the following kind of
> > > > > > > > > > situation :
>
> > > > > > > > > I'm guessing that when you say "suppress" you mean "represent in a
> > > > > > > > > database". Correct?
>
> > > > > > > > No : I want to destroy, remove, kill ... a part of the data (a
> > > > > > > > complete tree or just a branch), but without destroying data shared by
> > > > > > > > other trees or branches.
>
> > > > > > > Would the mark and sweep algorithm suit you purposes?
>
> > > > > > No
>
> > > > > > the mark and sweep algorithm needs to know all the tree roots in order
> > > > > > to mark all the used objects.
>
> > > > > > In my example I indicates that the node b3 belongs to the root r2
> > > > > > just for information to explain the storage count. But the deletion
> > > > > > routine I want to write just receives "r1" as argument, nothing else.
> > > > > > It does not know that r2 exists too ... It is just able to detect the
> > > > > > existence of other objects via the storage count of b3 which is a
> > > > > > little bit too high for being just referenced by r1 !
>
> > > > > For a directed edge from node n1 towards node n2, let n2 be referred
> > > > > to as the "destination" node.
>
> > > > > n1 ---->---- n2
> > > > > source destination
>
> > > > > Maybe the following approach would work:
>
> > > > > It is assumed that node r1 is the root of the "tree" to be deleted.
>
> > > > > Step 1: Perform a trace (in the manner of a mark phase) treating r1
> > > > > as the one and only root and decrementing the storage counter on the
> > > > > destination node for each edge that is traversed. ie find the set X
> > > > > of nodes that are reachable from r1. Nodes that have already been
> > > > > visited are flagged to ensure each reachable node is visited exactly
> > > > > once. For each node during the recurse, retrieve the set of outgoing
> > > > > edges yielding a set of destination nodes. Decrement each destination
> > > > > node's storage counter (this must be done for all the destination
> > > > > nodes - ie including the nodes that have already been visited). Then
> > > > > recurse into the destination nodes that haven't been visited before.
>
> > > > > Step 2: Find the subset R of X corresponding to nodes whose storage
> > > > > counter hasn't fallen to zero.
>
> > > > > Step 3: Run a mark and sweep using roots R and sweeping over X. For
> > > > > each edge traversed in the mark phase, increment the storage counter
> > > > > on the destination node.
>
> > > > > In your example I think step 1 will find X = {r1,b1,b2,b3,b4} and only
> > > > > b3 will end up with a nonzero storage counter, so step 2 yields R =
> > > > > {b3}. Then in step 3, a trace from roots {b3} will bump the storage
> > > > > counter of b4 back up to 1 and only {r1,b1,b2} will be deleted.
>
> > > > > Am I making sense?
>
> > > > Yes.
>
> > > > I was arrived to a very similar solution as I explained in my second
> > > > message. Nevertheless, several steps are expensive, especially if I
> > > > suppose forbidden to add new pieces of information in the nodes
> > > > themselves : only the creation a new local tables of linked lists is
> > > > authorized. For instance :
>
> > > > - in the step 1, flagging a visited node to avoid recursion might be
> > > > expensive. At the moment, I create a linked list of nodes in
> > > > traversing the sub-tree. So when I visit a node then I start by
> > > > looking for it in the linked list. This simple research is expensive
> > > > (cost in O(n2) if n is the number of nodes in the subtree). If the
> > > > node is not found, then I add a new term in my linked list which
> > > > contains the node index and its storage count minus 1 (at the end,
> > > > this count will be the external count). If I find it then I just
> > > > decrement the external count in the found term.
>
> > > > - the step 2 is OK (cost in O(n))
>
> > > > - the step 3 may be expensive too (all depends on the number of nodes
> > > > with an external count greater that 0).
>
> > > > In my solution, the step 3 was replaced by traversing again the tree
> > > > from r1 and going down a node only if its external count is equal to
> > > > zero. During this step, the true storage count of nodes is decremented
> > > > and nodes with a count equal to zero are destroyed. But this solution
> > > > has again a drawback : looking for the current visited node in the
> > > > linked list in order to get its external count. A possible improvement
> > > > was to create a new field in each node used to store the external
> > > > count there rather than in a parallel linked list.
>
> > > > Another solution might be to create an array which size is the total
> > > > number of nodes. This array will replace the linked list. The
> > > > advantage is clear : looking for the external count in this array is
> > > > immediate, the node index corresponding to the array index. The
> > > > drawback is the size of the array, the total number of nodes being
> > > > potentially huge ( > 1 million) whereas the number of nodes in a
> > > > deleted branch is generally several orders of magnitude lower (but
> > > > often greater than 1000).
>
> > > Oups ! I think that the main drawback is the fact that this solution
> > > does not work correctly !
> > > So the step 3 of David is much better !
>
> > > After looking for GC on internet, I found the Python GC uses a
> > > technique similar to the David's solution.
>
> > Interesting. I had thought Python only used basic reference counting.
>
> > > > An intermediate solution consists in using an extensible hash
> > > > table ...
>
> > > > In any case, I would like a solution in O(n) else such algorithm is
> > > > not tractable.
>
> > The solution I suggested is O(n) where n = |X| if all of the following
> > are O(1)
> > * Retrieving the set of outgoing edges for a given node
> > * Retrieving the destination node for a given edge
> > * Reading/writing the visited flag for a given node
> > * Reading/writing the storage count for a given node
>
> > I would avoid linked lists. Assuming hash lookup is O(1), I think my
> > suggested solution can be O(n) despite constraints on what information
> > may be stored in a Node.
>
> How could you warrant that the step 3 is O[n] ? If the number of roots
> is high, this is not obvious because each mark and sweep algorithm is
> potentialy O(n).
>
> Or except perhaps if one avoids to visit other roots when dealing with
> a particular root ... Yes this could be O(n) finally.

Step 3 only involves a *single* mark and sweep. It begins by flagging all the roots as visited and pushing them on a stack. The mark phase involves popping a node from the stack and pushing adjacent nodes that haven't yet been visited, until the stack is empty. Note that it is best to mark nodes as visited as they are pushed onto the stack (rather than when they're popped). So at all times the stack only contains nodes that have been marked as visited. This makes it clear that the algorithm will never process a node more than once.

Step 3 will typically be faster than step 1 even though it may start with multiple roots. The time taken by a trace has more to do with the total number of visited nodes than the initial number of roots, and step 3 always visits a subset of the nodes that were visited in step 1. Received on Thu Jan 24 2008 - 16:38:09 CET

Original text of this message