| Oracle FAQ | Your Portal to the Oracle Knowledge Grid | |
Home -> Community -> Usenet -> comp.databases.theory -> Re: Searching Google n-gram corpus
bobterwillinger_at_gmail.com wrote:
>>In actuality, this much better handled by a custom search engine >>designed along the same lines but with a lot of compression. If you are >>interested in the latter, I will be willing to explain further.
The word lists can be stored one record per word:
wordid as described above, position in ngram, number of ngrams, list of ngrams
The list of ngrams starts with a marker indicating whether the list of ngramids or a bit list with each bit position representing an ngram. If it a bit list, several types of RLE can be used to shorten it.
I have used these techniques on everything from a TRS-80 to a 16-way 3090, currently called (I think) X machines in IBM parlance, to great effect to speed up (and just make possible) searches over corpuses that would otherwise not even be possible (like a 4 million volume library, searchable on every title, author, publisher, etc.). Received on Wed Sep 12 2007 - 01:37:57 CDT
![]() |
![]() |