Good way to implement string interning?

From: Peter Seibel <peter_at_gigamonkeys.com>
Date: Tue, 28 Feb 2006 17:36:21 GMT
Message-ID: <m2wtffv10v.fsf_at_gigamonkeys.com>



Suppose I have a file containing several million strings (say URIs) and I need to assign each unique string a numeric ID and store the string-to-ID and ID-to-string mapping as (time) efficiently as possible. Is there some clever technique databases use for this kind of task. What about if we know that many of the strings are going to have common prefixes (e.g. "http://www.gigamonkeys.com/book/" and "http://www.gigamonkeys.com/lispbox/")

The obvious solution is simply to use a btree with the strings as keys and the IDs as values, then each time we see a string we look it up in the btree and if it's there were done, otherwise we insert it with the next ID as the value. I've also considered using linear hashing (thus my post of the other day) since I don't need to be able to do range queries on the keys--just quickly determine what the ID associated with a string is. Another possibility, that I'm not really comfortable with, is to computer some sufficiently large hash of each string, such as a 96- or 128-bit CRC, and using that as the ID, living with the rather low probability of having collisions.[1]

Is there some other technique that the big boys actually use for this kind of thing that I'm missing?

-Peter

[1] Per the birthday paradox, you need to have 2^38 unique strings before the probability of having a collision in a uniformly distrbuted 128-bit hash is enough greater than 0 to be expressed as a double float. And then the probability is only around one in 10^16. Of course if you *do* have a collision then your data is good and corrupted which is why I'm not so crazy about this idea.

-- 
Peter Seibel           * peter_at_gigamonkeys.com
Gigamonkeys Consulting * http://www.gigamonkeys.com/
Practical Common Lisp  * http://www.gigamonkeys.com/book/
Received on Tue Feb 28 2006 - 18:36:21 CET

Original text of this message