Re: Searching Google n-gram corpus
Date: Wed, 12 Sep 2007 02:37:57 -0400
>>In actuality, this much better handled by a custom search engine >>designed along the same lines but with a lot of compression. If you are >>interested in the latter, I will be willing to explain further.
> thanks, that makes sense.. what kind of compression do you mean?
Well there are at least two different directions of compression: words in the n-grams and the word lists. since all of the words in the 2-grams through 5-grams are guaranteed to be in the 1-grams, consider sorting the 1-grams in frequency order and assigning each word a compression id; the first 127 get 1 byte ids x'01'-x'7f', the next 16328 get 2 byte ids x'8000'-x'bfff', and the remainder 3 byte ids from x'c00000'-x'dfffff' (that is 2 million words, much more than in english -- if foreign languages are included the 4 byte ids go from x'e0000000'-x'efffffff', an additional 256 million words, which should be enough). Now record the 2-5 grams with the corresponding 1-gram ids which should shorten them to an average of around two bytes per word. If that is not enough we could look at prefix trees to reduce the storage even further: every n-gram's leading (n-1)-gram is guaranteed to exist in the corpus, so the representation of an n-gram is (n-1)-gram id and the word id of the last word.
The word lists can be stored one record per word:
wordid as described above, position in ngram, number of ngrams, list of ngrams
The list of ngrams starts with a marker indicating whether the list of ngramids or a bit list with each bit position representing an ngram. If it a bit list, several types of RLE can be used to shorten it.
I have used these techniques on everything from a TRS-80 to a 16-way 3090, currently called (I think) X machines in IBM parlance, to great effect to speed up (and just make possible) searches over corpuses that would otherwise not even be possible (like a 4 million volume library, searchable on every title, author, publisher, etc.). Received on Wed Sep 12 2007 - 08:37:57 CEST