Re: Hashing for DISTINCT or GROUP BY in SQL

From: Sampo Syreeni <decoy_at_iki.fi>
Date: Tue, 19 Oct 2010 17:18:21 -0700 (PDT)
Message-ID: <01309eab-9da6-4c9f-8532-c6625cd3de72_at_h7g2000yqn.googlegroups.com>

On Oct 12, 7:35�pm, -CELKO- <jcelko..._at_earthlink.net> wrote:

> In the old days, a DISTINCT or GROUP BY in an SQL engine were done by
> a sort. Back then we built early SQLs on top of existing file systems
> and we had pretty good sort procedures in the library. How many kids
> today have ever seen a polyphase merge sort on tape drives?

Simulated that on disk, been there. You have to be rather keen on the "tape" labels unless you were fancy enough to well-errorcheck in code.

> But today with parallel hardware and good hashing algorithms, would it
> be faster to use hashing to cluster equivalent classes of data
> together?

In case of flat distributions and good enough caching, often yes. Which is why e.g. under Oracle we rarely see sort-merge joins, and often see hash joins. (I hope the Grace, zigzag or one of the other more pipelined and well-cached ones.) When there is skew, adaptive subdivision algorithms work better, as in dynamic hashing or 0- complete trees of pages. For flat distributions with limited total cardinality, binning (and by extension dynamic radix sort like algorithms) work better. In any case, buffer trees are near-optimal in aggregate, while splay-tree and other near-cache-oblivious designs can both lead to near-linear access and cache locality.

> Does there exist a set of hashing functions, H1(), H2(), .., Hn()
> which will produce at least one different result for any pair of data
> values?

In general, obviously not unless the number of different inputs exceeds the number of possible outputs. That is the pigeon-hole principle right there.

At the very limit when the possible numbers of inputs and outputs are precisely equal, it's a definite yes. Then you'd be aiming at a pseudorandom permutation in an arbitrary field. That can always be done by choosing some suitable elementary distribution of cycle lengths which sum up to the total width and then mixing up the elementary ones, randomly. It's messy and expensive, but yes, it can be done.

If the number of different inputs is below the number of possible outputs, obviously yes. You could just use a potentially *huge* lookup table. But what you're likely looking for here and in the previous case is a computationally efficient means of distributing the values. In all cases you'd want to have some extra, global metric of "good redistribution" in excess as well.

This sort of thing is then highly nontrivial. The physicists have formalized this sort of intuition pretty well within ergodic theory, using continuous variables. There it's about globally measurepreserving mappings. But the discrete case remains nasty, in small- -cases proven-to-be-insoluble, in very small cases proven-to-be- inapproximable as well, and even in larger cases still computationally intractable when phrased as an optimization problem; cryptographers battle with this stuff everyday, because what you're probably asking for is also the ideal mixing function within a symmetric cipher, given a constant key.

Finally, let's not forget that this framework perhaps isn't the best one for us. Most of our data isn't random, and most certainly it isn't processed in a random order. That is why a simple linear congruence relation works so well as a hash function: two separate inputs will not lead to the same outcome unless they are precisely the number of possible outcomes apart. Increasing input values lead to increasing output values, except when we wrap around, so that we can utilize linear instead of random scans further down the stack. We can always choose the wraparound to be mutually prime to most of the known cycles our data might have, like day-month-year cycles and estimated peaks in name statistics if they should statistically affect the input. And as long as you hash, you just can't avoid ones, zeroes, nulls and the like being rather singular in the input; you have to do something different with them in any case. Received on Wed Oct 20 2010 - 02:18:21 CEST

This message: [ Message body ]
Next message: paul c: "Re: Hashing for DISTINCT or GROUP BY in SQL"
Previous message: Sampo Syreeni: "Re: The Relational Model & Queries That Naturally Return Duplicate Rows"
In reply to: -CELKO-: "Hashing for DISTINCT or GROUP BY in SQL"
In reply to -CELKO-: "Hashing for DISTINCT or GROUP BY in SQL"
Next in thread: paul c: "Re: Hashing for DISTINCT or GROUP BY in SQL"
Reply: paul c: "Re: Hashing for DISTINCT or GROUP BY in SQL"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

Original text of this message