Re: Search Engine Design

From: Kieran Elby <kieran_at_dunelm.org.uk>
Date: Thu, 27 Jun 2002 23:09:52 +0100
Message-ID: <3D1B8D30.D2903B24_at_dunelm.org.uk>


Matthew Paterson wrote:
>
> Hello all,
> I have a few questions about designing a search engine. This will be
> much like a number of other search engines in which the search will be
> based on a number of keywords. As well an advanced search will allow the
> user to search by a number of criteria. So here is my main question:
>
> What would be best:
> 1. To have one large table with all the keywords comma separated in one
> column and then when a user searches it goes through the whole table
> looking in the keyword column and parsing the list to see if it exists.
>
> OR
>
> 2. In the keyword table have a list of all the URL's that match that
> keyword. So if the user is searching for computer then it goes to the
> keyword table finds the keyword computer and returns a list of all URL's
> that match.
>
> The first way would be searching the whole table and the second
> requires the database to search the keyword table - which will be much
> smaller. There will possibly be up to 20,000 urls with 500 keywords - I
> just want to account for scalability.
>
> Any help you be appriciated. If there is a better way or anyone has any
> suggestions or resources please reply.
>

Wouldn't it be easier to do something along the lines of:

create table tSite (

    url char(255),
    keyword char(31)
) primary key (url, keyword);

Each tuple in tSite would indicate that the site at url has the word keyword associated with it - e.g.

tSite
url keyword

www.amazon.com   books
www.amazon.com   movies
www.amazon.com   buy
www.imdb.com     movies
www.imdb.com     database

...

To search it -

A 'match any' search is very easy and extremely efficient for the database to perform:

select url, keyword from tSite where keyword in ('buy','movies');

3 rows returned:

www.amazon.com, movies
www.amazon.com, buy
www.imbd.com, movies

Most database engines will create an index of some sort on keyword when you create the table (since it forms part of the primary key) so the database will be able to locate the entries very quickly, and without scanning the whole table.

A 'match all words' search is a bit trickier - this is the first approach I thought of:

(excuse the clumsy join syntax!)

select url from
  tSite s1,
  tSite s2,
  tSite s3
where

    s1.keyword = 'buy'
and s2.keyword = 'movies'
and s3.keyword = 'books'
and s1.url = s2.url
and s1.url = s3.url;

1 row returned:
www.amazon.com

Again, any decent database should be able to do this pretty efficiently.

Of course, if you want to start doing fancy searches like "(movies || books) & buy & not database", then you'd need a fairly clever algorithm to parse your search expression and generate the sql on the fly. Alternatively, you may want to do a simpler query and restrict it further in the application.

I'm sure there must be some academic literature out there on this kind of thing.

Regards,
Kieran Elby Received on Fri Jun 28 2002 - 00:09:52 CEST

Original text of this message