Re: data structure for strings
Date: 2000/08/13
Message-ID: <Pine.LNX.4.10.10008131307010.23762-100000_at_Gloria.CAM.ORG>#1/1
> I would suggest you use a hash table with a hash function chosen
> depending on what you mean by the comparison operator *like*.
>
Sorry, I meant "identical" rather than "like". The only option I would
like to provide is case insensitivity, which isn't much of a problem
(simply store the records in lower-case and make sure to convert strings
to lower-case before retrieving from the table). I have indeed decided to
go for the hash table approach. I would've liked something like a perfect
hashing function, like that provided by gperf, but I must support anagrams
which gperf doesn't.
> The hash function and table size can be set up to give you an
> average of one function calculation and one word comparison per
> lookup.
>
Also, depending on the hash function chosen, I don't even need to hash
each character in the string. As Cichelli's article in CACM explains, only
hashing first and last characters as well as the length is sometimes
enough, although it necessarily requires a strcmp afterwards. In any case,
regardless of the hashing function, it should always be necessary to
strcmp afterwards since it will always be possible for two strings to hash
to the same value.
> If you mean *like* and not /identical/ your hash function should
> give similar outputs with similar words, you could use the
> Soundex or a similar phonetic encoding before hashing, you could
> generate a bit string with bits indicating letters present as
> input to the hash function; or forget the hashing and compute
> something like a bulls and cows score for each word in the table
> relative to the input word, with the highest bulls and cows
> scores giving the nearest matches to the input word.
>
Interesting, I've never heard of this "bulls and cows score". Can you
explain some more, I don't quite understand.
Received on Sun Aug 13 2000 - 00:00:00 CEST