Re: Hashing for DISTINCT or GROUP BY in SQL

From: Tegiri Nenashi <tegirinenashi_at_gmail.com>
Date: Tue, 12 Oct 2010 13:08:10 -0700 (PDT)
Message-ID: <80313ee0-3396-4c09-9c00-403159459167_at_26g2000yqv.googlegroups.com>


On Oct 12, 8:35 am, -CELKO- <jcelko..._at_earthlink.net> wrote:
> In the old days, a DISTINCT or GROUP BY in an SQL engine were done by
> a sort. Back then we built early SQLs on top of existing file systems
> and we had pretty good sort procedures in the library. How many kids
> today have ever seen a polyphase merge sort on tape drives?
>
> But today with parallel hardware and good hashing algorithms, would it
> be faster to use hashing to cluster equivalent classes of data
> together?
>
> Obviously, if two data values are equal, they will have the same hash
> for all hashing functions. But two different values can have the same
> hash for any one hashing function.
>
> Does there exist a set of hashing functions, H1(), H2(), .., Hn()
> which will produce at least one different result for any pair of data
> values?

This days generics programming is ubiquitous. In C++ there is Standard Templates Library, while in Java there is Collections Library. Either way one usually declares a Set object, and not SortedSet, or HashSet. This way one can easily switch from one implementation to the other, and delegate hard concurrency and performance issues to masters like Doug Lee. Speaking SQL, some of the "big 3" vendors still code in C, but that is another story... Received on Tue Oct 12 2010 - 22:08:10 CEST

Original text of this message