Re: data structure for strings

From: Marc Tardif <intmktg_at_Gloria.CAM.ORG>
Date: 2000/08/13
Message-ID: <Pine.LNX.4.10.10008131307010.23762-100000_at_Gloria.CAM.ORG>#1/1


> I would suggest you use a hash table with a hash function chosen
> depending on what you mean by the comparison operator *like*.
>
Sorry, I meant "identical" rather than "like". The only option I would like to provide is case insensitivity, which isn't much of a problem (simply store the records in lower-case and make sure to convert strings to lower-case before retrieving from the table). I have indeed decided to go for the hash table approach. I would've liked something like a perfect hashing function, like that provided by gperf, but I must support anagrams which gperf doesn't.

> The hash function and table size can be set up to give you an
> average of one function calculation and one word comparison per
> lookup.
>
Also, depending on the hash function chosen, I don't even need to hash each character in the string. As Cichelli's article in CACM explains, only hashing first and last characters as well as the length is sometimes enough, although it necessarily requires a strcmp afterwards. In any case, regardless of the hashing function, it should always be necessary to strcmp afterwards since it will always be possible for two strings to hash to the same value.

> If you mean *like* and not /identical/ your hash function should
> give similar outputs with similar words, you could use the
> Soundex or a similar phonetic encoding before hashing, you could
> generate a bit string with bits indicating letters present as
> input to the hash function; or forget the hashing and compute
> something like a bulls and cows score for each word in the table
> relative to the input word, with the highest bulls and cows
> scores giving the nearest matches to the input word.
>
Interesting, I've never heard of this "bulls and cows score". Can you explain some more, I don't quite understand. Received on Sun Aug 13 2000 - 00:00:00 CEST

Original text of this message