Word Frequencies

From: John Paulett <jmpcrew_at_hotmail.com>
Date: Tue, 06 Jun 2006 08:35:53 -0400
Message-ID: <e63sr9$5c8e$1@netnews.upenn.edu>

Hello,

I am trying to analyze word frequencies in articles (stored in clob's). There are several million articles, each which has an average length of about 150 words. My goal is to eventually break the data down by the author and category of the article to see how many unique words are used (I would need a count of the number of times each unique word appears).

I was initially thinking of just processing all of this using Java, but I was wondering if there is some way to natively do it in Oracle / SQL, possibly using Oracle Intermedia or Text. I have 10g on Windows. I started trying to use the index tables (e.g. DR$BOOKS_TITLE$K), but I do not want to ignore words like "the" and "a," and from what I could see this would not give me the option of breaking the analysis down by author and category.

Any help is greatly appreciated -- even just a point in the right direction (or if you think I should do this some other way, like Java or PL/SQL). Thanks,

John
jmpcrew_at_hotmail.com Received on Tue Jun 06 2006 - 07:35:53 CDT