Re: Relational/hierarchical data problem

From: Bob Badour <bbadour_at_golden.net>
Date: Fri, 21 Feb 2003 01:56:22 -0500
Message-ID: <esk5a.162$Yh.17391651_at_mantis.golden.net>


Will,

I don't know what back-end you are using but thousands of animals and hundreds of births should not be a problem. You might have a problem if you had 50,000 animals in your herd and if each breeding female had a litter of 15. A hundred twenty goats bearing a couple kids at a time shouldn't be a problem at all.

Going back 5 generations, you are only dealing with 64 transitive closure records per animal. Using 32-bit integers for ID's etc, that works out to only 1k per animal before indexing. The last time I bought a hard-drive, which was several years ago now, it only cost $10 Cdn per gigabyte, which works out to about $6.50 US per gigabyte. Being generous and assuming 4 times as much storage for indexing as for the heap, $6.50 will buy enough storage for 200,000 animals.

The way I see it, you can calculate the transitive closure once then use it over and over, or you can re-calculate it each time you use it. I strongly suspect that calculating it once will cost less overall.

It's possible I used the wrong term when I named one column Consang. I did not mean it to represent the consanguinity between the descendant and the ancestor per se or to represent any standard method of calculation. I simply reasoned thus: If you breed two half-siblings, the progeny will only have three grandparents. For instance, if the half-siblings share a mother, their progeny will have twice the genetic contribution from their one and only grandmother.

It seems to me: If you are considering breeding one of those progeny with another descendant of the grandmother, the prospective pair are more consanguinous than might appear from simply sharing the female ancestor because her genetic contribution already appears twice for one of the prospective animals.

I do not know how breeders and geneticists calculate consanguinity, but I assumed that the doubling of an ancestor at a particular generation would have some significance so I included it and called it Consang. Perhaps, there is another name for the attribute or perhaps breeders simply ignore it. I don't know.

I assume the original table that records facts about each animal will also have a column indicating the consanguinity of each animal's birth. Sharing an ancestor seems riskier to me if the ancestor is already highly consanguinous. But, then again, as I say above, I know little about genetics and even less about husbandry so breeders and geneticists may know better and do something entirely different.

Don't forget that you can calculate most of the transitive closure for each new birth from the transitive closures of the parents using an "INSERT SELECT statement. The query, restricted to the number of generations required, would group by the ancestor and generation and add one to the generation. Maintaining the column I called Consang would be a SUM aggregate. Then an "INSERT VALUES" statement can create the "self" record with generation=0 and what I called Consang=1.

Again, I don't know what back-end you are using, but I expect sub-second response for those two statements.

The queries for pairing animals will still take much longer than that. Looking for NOT EXISTS will give the least consanguinity. Once all the animals share a common ancestor, though, the NOT EXISTS query will not return any results.

That leaves you with a minimax problem, which probably means doing a group by of an already grouped result. Regardless, both the NOT EXISTS and the double grouping are expensive operations.

You will want to restrict the query as much as possible. I assume your original table records which animals are breeding candidates ie. neither slaughtered nor sold, having reached sexual maturity and still fertile etc.

I would probably create temporary tables for the breeding males and the breeding females--you only need the id's. I would try these indexed and unindexed. They are likely to fit on a single database page, so indexing them might slow the query down.

I assume you will report the potential breeding males for each female by increasing consanguinity. If you have multiple breeding males, you may want to breed all of them even if they are never the least consanguinous.

You can purge old records from the transitive closure table too. Once all of a slaughtered animal's progeny are born and their transitive closures calculated, you may delete the transitive closure records where the descendant is the slaughtered animal.

Or, if you have a short list of breeding animals, you can extract their transitive closures into a temporary table and index it for the query.

Good luck,
Bob

"Will" <mrbrown_1998_at_yahoo.com> wrote in message news:8792b468.0302200353.694394c6_at_posting.google.com...
> Thanks to all who have replied.
>
> Bob - I found what you said very interesting. The number of animals in
> this case (goats, incidentally) is likely to be large. There are 41
> goats in the current herd and this equates to 208 animal records at
> present. The herd will stabilise at about 120 animals and will have a
> high turnover (they are meat goats). We are legally obliged to track
> the past 5 generations for any given animal, so in a fairly short time
> we will be dealing with thousands of animal records.
>
> > It might be possible
> > to
> > > pre-compute the transitive closure in another table then use that to
> > > evaluate the shared ancestors.
>
> I'd agree with that - it seems the best way to retain as much
> performance as possible.
>
> According to what my understanding, collateral consanguinity can be
> determined by two methods:
>
> 1) Canon/common law method - The degree of the more remote of the two
> subjects from their nearest common ancestor.
> E.g.
> Entity 1, Entity 2, Common Ancestor, Consang
> --------------------------------------------
> sibling, sibling, parent, 1
> child, parent, parent, 1
> child, grandparent, grandparent, 2
> nephew, uncle, grandparent, 2
> first cousin, first cousin, grandparent, 2
>
> 2) Civil law method - Add the degree of both entities from the nearest
> common ancestor.
> E.g.
> Entity 1, Entity 2, Common Ancestor, Consang
> --------------------------------------------
> sibling, sibling, parent, (1 + 1) = 2
> child, parent, parent, (1 + 0) = 1
> child, grandparent, grandparent, (2 + 0) = 2
> nephew, uncle, grandparent, (2 + 1) = 3
> first cousin, first cousin, grandparent, (2 + 2) = 4
>
> Evidently method 2 is a better one to use because there is less
> amibiguity.
>
> Bearing the above in mind, I have a question about the consanguinity
> values you stated:
>
> > > If parents are half-siblings, one of the grandparents would have
> > Consang=2.
>
> Taking one of the half-siblings as the subject, wouldn't the following
> be true?
> - parent -> any grandparent would be Consang=1
> - children of either parent -> any grandparent would be Consang=2? I
> don't see why one grandparent would have a different degree of
> consanguinity to any of the others from the perspective of either
> their children or their children's children. How this is affected by
> the fact that the parents are half-siblings?
>
> Thanks for your help so far!
>
> Regards
>
> Will
Received on Fri Feb 21 2003 - 07:56:22 CET

Original text of this message