Re: Good way to implement string interning?

From: Jan Hidders <jan.hidders_at_REMOVETHIS.pandora.be>
Date: Tue, 28 Feb 2006 20:47:00 GMT
Message-ID: <8F2Nf.285618$tz5.9107571_at_phobos.telenet-ops.be>


Peter Seibel wrote:
> Jan Hidders <jan.hidders_at_REMOVETHIS.pandora.be> writes:
>
>

>>Peter Seibel wrote:
>>
>>>Suppose I have a file containing several million strings (say URIs)
>>>and I need to assign each unique string a numeric ID and store the
>>>string-to-ID and ID-to-string mapping as (time) efficiently as
>>>possible. Is there some clever technique databases use for this kind
>>>of task. What about if we know that many of the strings are going to
>>>have common prefixes (e.g. "http://www.gigamonkeys.com/book/" and
>>>"http://www.gigamonkeys.com/lispbox/")
>>
>>An often used solution for the string -> id direction is a trie. For
>>a short introduction see Wikipedia:
>>
>>http://en.wikipedia.org/wiki/Trie

>
> Yup. Actually my current implementation uses an in-memory trie and is
> quite speedy but the memory usage gets a bit out of control when we're
> talking about millions or billions of unique strings. And there's the
> problem of making the trie persistent efficiently.
>
> Is there some obvious mapping from a trie to an updatable on-disk
> structure that I just don't know about?

Did you try a mixed approach? Like, taking a hash function as you already suggested, and then use a B-tree to store the association between the result of the hash function and the actual string (or strings, if there is a collision). That way you remove the strings from the index tree, which is what makes the trie probably so big, so maybe you could then still keep large parts of it in main memory.

Sorry, for not being more helpful. This is not my specialty, but I'm certainly interested.

  • Jan Hidders
Received on Tue Feb 28 2006 - 21:47:00 CET

Original text of this message