Creating the right index for multiple keyword searching
Date: 11 Apr 2003 03:56:25 -0700
Message-ID: <e6dbc8db.0304110026.421ca5e9_at_posting.google.com>
Hi,
I am developing a search engine (nothing special about that I know ;) )...and now at the stage of creating the right index to search. The spiders/crawler, ranking rules are done. Those run ok now (implemented in php, single machine does about 200K webpages per day, fully parsed)...
..One thing I do know now is that a SQL database is not fast enough to
give the search results after a querie.
I want to use some sort of inverted file, but drawing a blank here.
This is what I have:
inverted file looks like (only docid's with rank, fixed size)
000000002012
000000001009
000000003001
000000009089
000000005021
000000002008
with an index file (wordid and pointer to inverted file), I keep this one in memory
1,0 2,3 3,5
so searching for word 'something' which happens to have wordid 2, results in docid's 9 (rank 89) and 5 (rank 21) and i know from the indexfile there are 2 (5-3) docid's with that wordid.
Now for my problem: I want to be able to search with more than 1 keyword (AND and NOT search).
Using the same setup for the inverted file and indexfile, a search for 2 or more keywords would result in a long process to check ALL possibilities (to get optimal resultset)...now the docid's are sorted on rank, this makes merging more doclists even harder but sorting on docid would improve merging but still it would be unthinkable (read: take too much time) to check from optimal results when searchword1 has 150.000 results and searchword2 has 1,5 million...
...does anyone have a brilliant idea? I'm sure there is a fantastic solution for this...just am unable to find/think of it
thx
[also posted in comp.theory.info-retrieval, but that one seems to be inactive somewhat, hence the post here] Received on Fri Apr 11 2003 - 12:56:25 CEST