Re: Help with our Search Engine

From: Art Pollard <ctir_at_lextek.com>
Date: 8 Feb 2003 19:02:13 -0600
Message-ID: <5.1.0.14.2.20030208174842.03b568c0_at_mail.thinkware.net>



At 06:57 AM 2/8/2003 -0800, you wrote:
>We've built a search engine (Platform = ASP .NET/SQL server 2000) that
>is composed with the following characteristics:
><SNIP>
>-We are using SQL server full-text to index the data and then to
>perform the searches.
>
>-After the search is perform we have a ranking to display the website.
>The ranking is performed basically according to where the string was
>found, for example if the string was found in the title it has a
>higher point than if it was found in the body of the page.
>
>The Engine outputs results ok, the problem is that it is slow. (It
>takes from 5 to 60 seconds in a server with 1,2 Gigas of CPU)

I am actually surprised that it works as quickly as it does. SQL Server is not an appropriate tool for what you are trying to do. Why? Because the indexes in regular database systems are optimized for high volumes of inserts / deletions but with only a few keys per record. When working on regular text however, you have hundreds of keys (in this case words) and sometimes thousands per record. This makes the indices that are in standard database systems choke quite quickly.

What you need is a text retrieval system that is designed specifically for what you are trying to do.

A text retrieval system typically organizes the indexes much like the index for a book i.e.,

Apple: 1, 5, 34, 567, 2902, .... etc.

with all the index entries being stored contiguously.

while a standard database organizes the index into key value pairs i.e.,

Apple 1
Apple 5
Apple 34
Apple 567
Apple 292
etc...

And approximately 20% of the index size is empty space with many of the index key/value pairs being spread across many different blocks further slowing down access speed. The extra space BTW is one of the reasons why normal database systems accept insertions and deletions as well as they do. In an information retrieval system, it is often fatal when it comes to performance.

You may want to check out: http://www.searchtools.com which has a listing of all the commercial and non-commercial solutions that are available. It is also a very good resource in its own right.

-Art
Art Pollard
(Moderator comp.theory.info-retrieval ) Received on Sun Feb 09 2003 - 02:02:13 CET

Original text of this message