Re: Good way to implement string interning?

From: Peter Seibel <peter_at_gigamonkeys.com>
Date: Tue, 28 Feb 2006 18:58:14 GMT
Message-ID: <m2slq3ux8a.fsf_at_gigamonkeys.com>


Jan Hidders <jan.hidders_at_REMOVETHIS.pandora.be> writes:

> Peter Seibel wrote:
>> Suppose I have a file containing several million strings (say URIs)
>> and I need to assign each unique string a numeric ID and store the
>> string-to-ID and ID-to-string mapping as (time) efficiently as
>> possible. Is there some clever technique databases use for this kind
>> of task. What about if we know that many of the strings are going to
>> have common prefixes (e.g. "http://www.gigamonkeys.com/book/" and
>> "http://www.gigamonkeys.com/lispbox/")
>
> An often used solution for the string -> id direction is a trie. For
> a short introduction see Wikipedia:
>
> http://en.wikipedia.org/wiki/Trie

Yup. Actually my current implementation uses an in-memory trie and is quite speedy but the memory usage gets a bit out of control when we're talking about millions or billions of unique strings. And there's the problem of making the trie persistent efficiently.

Is there some obvious mapping from a trie to an updatable on-disk structure that I just don't know about?

> I've also heard of a related hashing technique called "trie hashing"
> introduced by ... drumroll ... Witold Lidwin. See:
>
> http://portal.acm.org/citation.cfm?id=582322
>
> But I'm not really familiar with that myself. If you don't have access
> to the ACM portal, let me know and I'll see what I can do.

Yeah, I bought that paper too. I skimmed it but it didn't seem to be going anywhere useful for my problem. Maybe I'll give it another look.

-Peter

-- 
Peter Seibel           * peter_at_gigamonkeys.com
Gigamonkeys Consulting * http://www.gigamonkeys.com/
Practical Common Lisp  * http://www.gigamonkeys.com/book/
Received on Tue Feb 28 2006 - 19:58:14 CET

Original text of this message