Path: dp-news.maxwell.syr.edu!spool.maxwell.syr.edu!news-spur1.maxwell.syr.edu!news.maxwell.syr.edu!logbridge.uoregon.edu!nntp.upenn.edu!not-for-mail
From: John Paulett <jmpcrew@hotmail.com>
Newsgroups: comp.databases.oracle.misc
Subject: Word Frequencies
Date: Tue, 06 Jun 2006 08:35:53 -0400
Organization: University of Pennsylvania
Lines: 24
Sender: mpt@206.125.56.212
Message-ID: <e63sr9$5c8e$1@netnews.upenn.edu>
Reply-To:  jmpcrew@hotmail.com
NNTP-Posting-Host: 206.125.56.212
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
X-Trace: netnews.upenn.edu 1149597353 176398 206.125.56.212 (6 Jun 2006 12:35:53 GMT)
X-Complaints-To: abuse@upenn.edu
NNTP-Posting-Date: Tue, 6 Jun 2006 12:35:53 +0000 (UTC)
User-Agent: Thunderbird 1.5.0.4 (Windows/20060516)
Xref: dp-news.maxwell.syr.edu comp.databases.oracle.misc:127771

Hello,

I am trying to analyze word frequencies in articles (stored in clob's).
 There are several million articles, each which has an average length of
about 150 words.  My goal is to eventually break the data down by the
author and category of the article to see how many unique words are used
(I would need a count of the number of times each unique word appears).

I was initially thinking of just processing all of this using Java, but
I was wondering if there is some way to natively do it in Oracle / SQL,
possibly using Oracle Intermedia or Text.  I have 10g on Windows.  I
started trying to use the index tables (e.g. DR$BOOKS_TITLE$K), but I do
not want to ignore words like "the" and "a," and from what I could see
this would not give me the option of breaking the analysis down by
author and category.

Any help is greatly appreciated -- even just a point in the right
direction (or if you think I should do this some other way, like Java or
PL/SQL).

Thanks,

John
jmpcrew@hotmail.com
