Oracle FAQ Your Portal to the Oracle Knowledge Grid
HOME | ASK QUESTION | ADD INFO | SEARCH | E-MAIL US
 

Home -> Community -> Usenet -> comp.databases.theory -> Searching Google n-gram corpus

Searching Google n-gram corpus

From: <bobterwillinger_at_gmail.com>
Date: Sat, 08 Sep 2007 14:36:16 -0000
Message-ID: <1189262176.498116.293040@g4g2000hsf.googlegroups.com>


(also posted in sql group but got no replies, apolgies if that's bad etiquette)

Hi,

Google released a corpus of n-grams collected from the Web.

http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-...

It contains all 1..5grams that occur more than 40 times in their web crawl. It comes as 5 folders, each folder containing around 120 files. Each file contains 10,000,000 (10^7) lines. A line looks like:

"this is a four gram 65"

where the last number is the frequency of that exact phrase. The total unzipped size of the 3 grams alone is 19GB, each individual file around 200MB.
All the unzipped data is around 100GB.

I would like to be able to search through all this and return all lines that contain a particular word or phrase. I have no idea where to start with this, but I was wondering would an SQL database be feasible. For the 5-grams i would need a billion rows and of 6 columns. What sort of hard disk space would I need, and what kind of time would i be looking at per search on on ordinary mahcine?,

I would like to be able to find every line where a particular word occurs, no matter which position it occurs in, and ideally I would like to be able to find particular bigrams as well.

thanks. Received on Sat Sep 08 2007 - 09:36:16 CDT

Original text of this message

HOME | ASK QUESTION | ADD INFO | SEARCH | E-MAIL US