Re: Search Engine
Date: 8 Nov 2002 11:03:54 -0600
Message-ID: <3dcbd11e$0$17555$45beb828_at_newscene.com>
<SNIP>
> So an inverted file is just a database table sorted by the search string?
> Or are you meaning a file eg PC text file? Some people refer to database
> tables as files (esp. AS/400 ppl)
Well, an "inverted file" is also called an "inverted index." The terms are interchangeable.
What you are trying to accomplish is to take out all the extra stuff that
you
would have with a traditional database system. For example, with a
traditional
database index, you are going to have say 60% of your index space being
empty space which is there because of BTree page splits and to allow for
easy updates to the index. If you have very many occurances of a word then
it also means that your index data for a given word will probably span
non-contiguous pages. Furthermore, you have a key / value pair for each
index entry. You really only need / want the value part because all the
keys
will be the same for extended periods. (For example, if you have 25 records
that have "apple" in it then you have 25 occurances of "apple" in your
index.
It is much more efficient to only have "apple" in your index once followed
by twenty five values.)
So, for example:
If you use a BTree for your index (which is what most / many databases use internally) you have:
25 words ("apple" (5 chars * 25) = 125 bytes
25 values (4 bytes * 25) = 100 bytes
25 internal values to the btree for management of data (4 bytes * 25) = 100
bytes.
325 bytes total for data + 40% for the empty space that is typically in a BTree
455 bytes.
Now if it is stored in the manner I suggested you have :
1 occurance of the word ("apple") = 5 bytes
25 values (4 bytes * 25) = 125 bytes.
1 value (tells how many records apple is in): 4 bytes
Total Length : 134 bytes
Now, for very small numbers of terms, it isn't going to make much difference
since either in this case can probably be read with a single disk read
unless
the first spans a Btree page boundry in which case it will take two seeks
and reads.
> > A traditional database has far too many inefficiencies in it to make it
> > appropriate for
> > doing searches such as you are describing.. At least that is the case
if
> > your database
> > gets to be of any significant size.
>
> From this I take it you are suggesting moving to a file file! (not a DB
> table). This would be quite a large change to any system. Any
experiences
> of actual performance gains or problems encountered?
Yes, I am saying that you should use an inverted index instead of a regular database if this is what you want to do.
As far as performance gains here is an example I ran awhile back:
We had an index of 150,000 record. The query had 8 terms in it that were part of a boolean expression. The catcher is that 3 of the terms occurred in 80% of the records. (So they each occurred 140,000 times.) To process this query, our text indexing and retrieval engine (Onix: see: http://www.lextek.com/onix/ ) was able to do it in less than 100ms on a 500MHZ machine with standard IDE drives. To solve this same query with a regular database would have taken I'd guess 10-20 minutes.
So, yes, the speedups are considerable. On the other hand, there were quite a few optimizations on top of what I have told you that we performed. But the first major step is to build an inverted index.
-Art Received on Fri Nov 08 2002 - 18:03:54 CET