Re: Basic help needed: how does indexing work...?!

From: Robert Stearns <is_at_uga.edu>
Date: Wed, 13 Feb 2002 10:28:05 -0500
Message-ID: <3C6A8605.9A0D00B4_at_uga.edu>


If you want to index everything, and compress the data at the same time, then some type of dictionary compression might be best. It is a two phase process.

In the first pass, you tokenize the input, and count the frequency of each token, say T_i. Sort on frequency descending and assign new tokens say T'_i, to old in a pseudo Huffman scheme: The first 127 tokens are assigned 128-255, the next 16,360 to 16386-32785, and the remainder to 131072 and up. That is 1 byte, 2 bytes, and 3 bytes respectively.

In the second pass, using the same tokenizer, for each T_i output two things: T'_i and the triple {length(T'_i), T_i, location of this token in the corpus, perhaps the defined word number, say L_j}. Store the stream of T'_i for reconstituting the definition on retrieval, along with a short table of where each definition starts, say D_k. Sort the stream of (length(T'_i),T_i,L_j) and store the results as 3 separate tables {T_i, L_j} depending on the length in T'_i.

Searching, which only supports 'searchstring%', consists of:   1 Tokenizing the query with the same tokenizer as in passes 1 and 2,     yielding Q_m the query, which might include boolean operators   2 Searching (binary is fastest) the three tables (in size order) for     the Q_m, getting for each one a set S_n of L_j which correspond to     the query term
  3 Using the boolean operators (and %), if present, to merge the S_n     appropriately, getting S_0, the list of desired words. Displaying the results is even simpler:   for each L_j in S_0 do

     p=L_j
     while p<L_(j+1) (using D from pass two)
        retrieve token T'_i, using the leading bits of the first byte
        to determine its length
        retrieve T_i from the appropriate table by subscript
        display T_i
	increment p by the current token's length
     display a record separator

Needless to say, I have glossed over many details, but I have used this general scheme several times in my career on everything from a TRS-80 Model III to the most current IBM390 with a modicum of success. The compression is not as good as ZIP, but you can decompress starting anywhere. The speed is fairly fast for search and very fast for display. The programming is not complex, with the possible exception of the tokenizer. There many possible areas for further compression, the most obvious being the storage of the L_j in the three tables. Lars Grunewaldt wrote:
>
> Thx for replying.
>
> Well, we just started the project, so we're not sure about anything,
> right now. Let's see:
> - the databases are converted on PC side, so there is a fast processor
> for the index process
> - the database is not planned to change on PDA. just uploaded and used.
> Maybe it'l come to "add this to your dictionary", but, not for now. If
> faster searches could be improved by dropping that, ... no problem.
> - the database records will be compressed, or, in fact, some may be,
> others not (depends on the database). Indexing could be done before or
> after, we use Huffman encoding, so the search string could be simply
> compressed by the same algorithm and be matched after compression
> - our database records might look something like this:
> varchar language1_1
> varchar language1_2
> varchar language2
>
> maybe some integer/number values, but they're not so hard to compare, so
> I don't think we'll have to use an index there.
>
> we'd like to search as comfortable as possible, so, really great would
> be an indexing mechanism which makes something like '%searchstring%'
> possible (or at least, like a wordmatch in fulltext search, '%
> searchstring %'). At least we'd implement 'searchstring%'. We're going
> to have large databases on small memory size, so the index size is
> important to. I just have no experiences in this sector, what makes it
> even difficult to define our database structures - I could change them
> "this" or "that", if I'd know what I'm looking for. So, thats why I ask
> for something like an "howto-Index" or so... but all I found on the web
> where things like "howto-develop-beginner-mysql&php-websites" - done
> that for 3 years now, so I know, basically, how indicies work from the
> user side. But from the database side....?
> *sigh*
>
> hopes this helps to answer my questions :)
>
> regards,
> lars
>
> Robert Stearns wrote:
>
> > I would need to know more about your needs before I could make a
> > reasonable recommendation. Are you trying to index just the headwords of
> > your dictionary or all of the definitions (quotes, examples, etc.; see
> > OEDV2 on CD to see what complete indexing is like). You mentioned a PDA:
> > are you going to compress your data, either before or after indexing? Is
> > speed of retrieval more important either space or time of indexing? Is
> > the dictionary: static; slow growth only; adds and deletes? Does the
> > indexing have to be done on the PDA or will it be done on some type of
> > faster processor?
> >
> > Lars Grunewaldt wrote:
> >
> >>OK, in short. I'm developing a database application (in fact, call it
> >>dictionary). No, not just an application running on MySql, Sybase,
> >>Ingres, Oracle. I *must* do this without (lets just say, PDA stuff. have
> >>a look at sourceforge if you'd like. www.sourceforge.net/projects/padict).
> >>
> >>It may be possible that my database size gets so large that I'd need
> >>indicies on my text entries. So, I would need some clues about "how to
> >>index".
> >>
> >>I know a little bit about hash theories, but, what I need is a Book
> >>advice, some newsgroups I could dig through, websites, whatever.
> >>
> >>I searched google, where I could find anything else 'til now.
> >>
> >>But: I'm stuck.
> >>
> >>If you know some good books, or websites, docs to read, samples,
> >>opensource libraries...?
> >>
> >>regards,
> >> Lars
> >>
Received on Wed Feb 13 2002 - 16:28:05 CET

Original text of this message