Re: Hashing for DISTINCT or GROUP BY in SQL
From: -CELKO- <jcelko212_at_earthlink.net>
Date: Wed, 13 Oct 2010 19:15:55 -0700 (PDT)
Message-ID: <0cf5fec0-2f2b-4d23-88d1-9197cd60e073_at_g8g2000yqa.googlegroups.com>
Date: Wed, 13 Oct 2010 19:15:55 -0700 (PDT)
Message-ID: <0cf5fec0-2f2b-4d23-88d1-9197cd60e073_at_g8g2000yqa.googlegroups.com>
Let me see if I can be clearer.
There is a proof that for KNOWN set, we can always find a perfect hashing function; a stronger proof for a minimal perfect hashing function also exists. The gimmick is that you have to know the set in advance, which is fine for things like keywords in a programming language. This makes a very fast symbol table. But in general they are complicated and ugly; no more simple MOD() and shift stuff
But if I have a multi-set S of tuples {s1, s2, ..sn} that is not known in advance, can I find a family of hash functions that I can apply something like this, in parallel
- Do Hash_1(S) => bucket[1], bucket[2] .., bucket[n]
- keep all bucket[i] where MIN(s) = MAX(s); they have one value
- Do Hash_2 (S) on buckets where MIN(s) <> MAX(s)
- repeat until all buckets are single valued on the hash key s
The idea is to re-hash until all buckets have one value for the hash key columns. Received on Thu Oct 14 2010 - 04:15:55 CEST