Re: Basic help needed: how does indexing work...?!

From: Kendall <kendallwillets_at_yahoooo.com>
Date: Wed, 13 Feb 2002 10:20:45 -0800
Message-ID: <u6lbju50q1nlbc_at_news.supernews.com>


In article <3C69B745.7040706_at_netside.de>, "Lars Grunewaldt" <lg_at_netside.de> wrote:

> Thx for replying.
>
> we'd like to search as comfortable as possible, so, really great would
> be an indexing mechanism which makes something like '%searchstring%'
> possible (or at least, like a wordmatch in fulltext search, '%
> searchstring %'). At least we'd implement 'searchstring%'. We're going
> to have large databases on small memory size, so the index size is
> important to. I just have no experiences in this sector, what makes it
> even difficult to define our database structures - I could change them
> "this" or "that", if I'd know what I'm looking for. So, thats why I ask
> for something like an "howto-Index" or so... but all I found on the web
> where things like "howto-develop-beginner-mysql&php-websites" - done
> that for 3 years now, so I know, basically, how indicies work from the
> user side. But from the database side....? *sigh*
>
>
I went and looked at the description of your project. It looks like you're working on a translation (Japanese-English) dictionary.

One thing that might work is a trie for each language. You look up a word in a trie by starting at the root node, finding the child node corresponding to the first character of the word, finding the child for that node corresponding to the second character, and so on until you reach a leaf node designating the end of a legit word (each word corresponds 1-1 to a leaf node). You can find this description in lots of places, so I won't get into it too much.

For two-way translation, the following scheme might help: have a trie for each language, with each leaf node (representing a word) pointing to a leaf node on the other tree (representing the corresponding word in the other language). If each node on the trie has a back pointer to its parent, you can work your way up from a leaf node and get the characters of the word it represents (in reverse order). That way you can put an English word in the English trie, find the corresponding leaf node, follow the pointer to the corresponding Japanese word/node, and work your way back up the Japanese trie to find the characters of the translated word.

I'm not sure how well the trie would compress things, given the size of pointers these days, etc. You have the advantage that the data is static, so you could crunch pointers down to a few bits for a relative offset in an array, and of course you won't have any dead space.

A trie can also be augmented to support substring searching. Just link all the nodes for the same letter together (with an extra pointer on each node). Then, for a given substring, eg "%ab%", get all the "a" nodes and all the "b" nodes and look for pointers from the former to the latter. Once those are found expand them to get all the words containing the substring.

Also, look in the book "Programming Pearls" for info on how the original unix spell dictionary was compressed to 32k. Received on Wed Feb 13 2002 - 19:20:45 CET

Original text of this message