Re: Search Engine Design
Date: Thu, 27 Jun 2002 23:09:52 +0100
Message-ID: <3D1B8D30.D2903B24_at_dunelm.org.uk>
Matthew Paterson wrote:
>
> Hello all,
> I have a few questions about designing a search engine. This will be
> much like a number of other search engines in which the search will be
> based on a number of keywords. As well an advanced search will allow the
> user to search by a number of criteria. So here is my main question:
>
> What would be best:
> 1. To have one large table with all the keywords comma separated in one
> column and then when a user searches it goes through the whole table
> looking in the keyword column and parsing the list to see if it exists.
>
> OR
>
> 2. In the keyword table have a list of all the URL's that match that
> keyword. So if the user is searching for computer then it goes to the
> keyword table finds the keyword computer and returns a list of all URL's
> that match.
>
> The first way would be searching the whole table and the second
> requires the database to search the keyword table - which will be much
> smaller. There will possibly be up to 20,000 urls with 500 keywords - I
> just want to account for scalability.
>
> Any help you be appriciated. If there is a better way or anyone has any
> suggestions or resources please reply.
>
Wouldn't it be easier to do something along the lines of:
create table tSite (
url char(255),
keyword char(31)
) primary key (url, keyword);
Each tuple in tSite would indicate that the site at url has the word keyword associated with it - e.g.
tSite
url keyword
www.amazon.com books www.amazon.com movies www.amazon.com buy www.imdb.com movies www.imdb.com database
...
To search it -
A 'match any' search is very easy and extremely efficient for the database to perform:
select url, keyword from tSite where keyword in ('buy','movies');
3 rows returned:
www.amazon.com, movies www.amazon.com, buy www.imbd.com, movies
Most database engines will create an index of some sort on keyword when you create the table (since it forms part of the primary key) so the database will be able to locate the entries very quickly, and without scanning the whole table.
A 'match all words' search is a bit trickier - this is the first approach I thought of:
(excuse the clumsy join syntax!)
select url from
tSite s1,
tSite s2,
tSite s3
where
s1.keyword = 'buy' and s2.keyword = 'movies' and s3.keyword = 'books' and s1.url = s2.url and s1.url = s3.url;
1 row returned:
www.amazon.com
Again, any decent database should be able to do this pretty efficiently.
Of course, if you want to start doing fancy searches like "(movies || books) & buy & not database", then you'd need a fairly clever algorithm to parse your search expression and generate the sql on the fly. Alternatively, you may want to do a simpler query and restrict it further in the application.
I'm sure there must be some academic literature out there on this kind of thing.
Regards,
Kieran Elby
Received on Fri Jun 28 2002 - 00:09:52 CEST