Re: Hashing for DISTINCT or GROUP BY in SQL

From: -CELKO- <jcelko212_at_earthlink.net>
Date: Wed, 13 Oct 2010 19:15:55 -0700 (PDT)
Message-ID: <0cf5fec0-2f2b-4d23-88d1-9197cd60e073_at_g8g2000yqa.googlegroups.com>


Let me see if I can be clearer.

There is a proof that for KNOWN set, we can always find a perfect hashing function; a stronger proof for a minimal perfect hashing function also exists. The gimmick is that you have to know the set in advance, which is fine for things like keywords in a programming language. This makes a very fast symbol table. But in general they are complicated and ugly; no more simple MOD() and shift stuff

But if I have a multi-set S of tuples {s1, s2, ..sn} that is not known in advance, can I find a family of hash functions that I can apply something like this, in parallel

  1. Do Hash_1(S) => bucket[1], bucket[2] .., bucket[n]
  2. keep all bucket[i] where MIN(s) = MAX(s); they have one value
  3. Do Hash_2 (S) on buckets where MIN(s) <> MAX(s)
  4. repeat until all buckets are single valued on the hash key s

The idea is to re-hash until all buckets have one value for the hash key columns. Received on Thu Oct 14 2010 - 04:15:55 CEST

Original text of this message