Re: Search Engine

From: Art Pollard <pollarda_at_lextek.com>
Date: 8 Nov 2002 11:03:54 -0600
Message-ID: <3dcbd11e$0$17555$45beb828_at_newscene.com>


<SNIP>
> So an inverted file is just a database table sorted by the search string?
> Or are you meaning a file eg PC text file? Some people refer to database
> tables as files (esp. AS/400 ppl)

Well, an "inverted file" is also called an "inverted index." The terms are interchangeable.

What you are trying to accomplish is to take out all the extra stuff that you
would have with a traditional database system. For example, with a traditional
database index, you are going to have say 60% of your index space being empty space which is there because of BTree page splits and to allow for easy updates to the index. If you have very many occurances of a word then it also means that your index data for a given word will probably span non-contiguous pages. Furthermore, you have a key / value pair for each index entry. You really only need / want the value part because all the keys
will be the same for extended periods. (For example, if you have 25 records that have "apple" in it then you have 25 occurances of "apple" in your index.
It is much more efficient to only have "apple" in your index once followed by twenty five values.)

So, for example:

If you use a BTree for your index (which is what most / many databases use internally) you have:

25 words ("apple" (5 chars * 25) = 125 bytes 25 values (4 bytes * 25) = 100 bytes
25 internal values to the btree for management of data (4 bytes * 25) = 100 bytes.



325 bytes total for data + 40% for the empty space that is typically in a BTree

455 bytes.

Now if it is stored in the manner I suggested you have : 1 occurance of the word ("apple") = 5 bytes 25 values (4 bytes * 25) = 125 bytes.
1 value (tells how many records apple is in): 4 bytes



Total Length : 134 bytes

Now, for very small numbers of terms, it isn't going to make much difference since either in this case can probably be read with a single disk read unless
the first spans a Btree page boundry in which case it will take two seeks and reads.

> > A traditional database has far too many inefficiencies in it to make it
> > appropriate for
> > doing searches such as you are describing.. At least that is the case
if
> > your database
> > gets to be of any significant size.
>
> From this I take it you are suggesting moving to a file file! (not a DB
> table). This would be quite a large change to any system. Any
experiences
> of actual performance gains or problems encountered?

Yes, I am saying that you should use an inverted index instead of a regular database if this is what you want to do.

As far as performance gains here is an example I ran awhile back:

We had an index of 150,000 record. The query had 8 terms in it that were part of a boolean expression. The catcher is that 3 of the terms occurred in 80% of the records. (So they each occurred 140,000 times.) To process this query, our text indexing and retrieval engine (Onix: see: http://www.lextek.com/onix/ ) was able to do it in less than 100ms on a 500MHZ machine with standard IDE drives. To solve this same query with a regular database would have taken I'd guess 10-20 minutes.

So, yes, the speedups are considerable. On the other hand, there were quite a few optimizations on top of what I have told you that we performed. But the first major step is to build an inverted index.

-Art Received on Fri Nov 08 2002 - 18:03:54 CET

Original text of this message