Re: Searching Google n-gram corpus

From: Bob Stearns <>
Date: Sat, 08 Sep 2007 14:59:56 -0400
Message-ID: <5jCEi.288$ZU7.16_at_newsfe04.lga> wrote:
> (also posted in sql group but got no replies, apolgies if that's bad
> etiquette)
> Hi,
> Google released a corpus of n-grams collected from the Web.
> It contains all 1..5grams that occur more than 40 times in their web
> crawl. It comes as 5 folders, each folder containing around 120 files.
> Each file contains 10,000,000 (10^7) lines. A line looks like:
> "this is a four gram 65"
> where the last number is the frequency of that exact phrase.
> The total unzipped size of the 3 grams alone is 19GB, each individual
> file around 200MB.
> All the unzipped data is around 100GB.
> I would like to be able to search through all this and return all
> lines that contain a particular word or phrase.
> I have no idea where to start with this, but I was wondering would an
> SQL database be feasible. For the 5-grams i would need a billion rows
> and of 6 columns. What sort of hard disk space would I need, and what
> kind of time would i be looking at per search on on ordinary mahcine?,
> I would like to be able to find every line where a particular word
> occurs, no matter which position it occurs in, and ideally I would
> like to be able to find particular bigrams as well.
> thanks.
I believe that your approach is probably inappropriate for this data. If you want to remain with SQL consider a design like:

create table ngrams( ngramid longint generated always, ngram varchar(300), primary key ngramid)
create table words ( word varchar(25), pos shortint, ngramid longint, primary key word,pos,ngramid)

insert all the ngrams in the first table, then for each word in each ngram insert a record into words.

Search words to find the relevant ngrams containing any word and a self join of words for the bigrams on t1.word = 'firstword' and t2.word = 'secondword' and t1.pos = t2.pos-1 and t1.ngramid=t2.ngramid.

In actuality, this much better handled by a custom search engine designed along the same lines but with a lot of compression. If you are interested in the latter, I will be willing to explain further. Received on Sat Sep 08 2007 - 20:59:56 CEST

Original text of this message