Re: Hashing for DISTINCT or GROUP BY in SQL

From: Razvan Socol <rsocol_at_gmail.com>
Date: Tue, 12 Oct 2010 12:25:22 -0700 (PDT)
Message-ID: <7d9ca287-31f8-4d31-ad8b-e555b8409c0b_at_a15g2000yqm.googlegroups.com>


I assume that the hashing functions are applied to the same input (which is the entire row) and the hashes are somehow concatenated. As long as the size of the combined hashes is less then the size of the data row, there will always be the possibility to have two different rows which have the same hash. The pair may be difficult to find, one or both rows may be non-sensical, but they theoretically exist, so the answer to the last question is no (at least no practical hashing functions, that reduce the size of the input).

Regarding the other question, I think hashing would be faster than sorting (on today's parallel hardware), but it needs to be accompanied by a normal comparison afterwards, to produce the correct results.

Razvan Received on Tue Oct 12 2010 - 21:25:22 CEST

Original text of this message