Path: dp-news.maxwell.syr.edu!spool.maxwell.syr.edu!news-spur1.maxwell.syr.edu!news.maxwell.syr.edu!postnews.google.com!i40g2000cwc.googlegroups.com!not-for-mail
From: "Mark D Powell" <Mark.Powell@eds.com>
Newsgroups: comp.databases.oracle.misc
Subject: Re: Word Frequencies
Date: 6 Jun 2006 09:34:36 -0700
Organization: http://groups.google.com
Lines: 34
Message-ID: <1149611676.337028.19840@i40g2000cwc.googlegroups.com>
References: <e63sr9$5c8e$1@netnews.upenn.edu>
NNTP-Posting-Host: 192.85.50.1
Mime-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
X-Trace: posting.google.com 1149611681 18425 127.0.0.1 (6 Jun 2006 16:34:41 GMT)
X-Complaints-To: groups-abuse@google.com
NNTP-Posting-Date: Tue, 6 Jun 2006 16:34:41 +0000 (UTC)
In-Reply-To: <e63sr9$5c8e$1@netnews.upenn.edu>
User-Agent: G2/0.2
X-HTTP-UserAgent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322),gzip(gfe),gzip(gfe)
Complaints-To: groups-abuse@google.com
Injection-Info: i40g2000cwc.googlegroups.com; posting-host=192.85.50.1;
   posting-account=J7QqBQwAAABTieek3RP_669Gs2iATWzr
Xref: dp-news.maxwell.syr.edu comp.databases.oracle.misc:127781


John Paulett wrote:
> Hello,
>
> I am trying to analyze word frequencies in articles (stored in clob's).
>  There are several million articles, each which has an average length of
> about 150 words.  My goal is to eventually break the data down by the
> author and category of the article to see how many unique words are used
> (I would need a count of the number of times each unique word appears).
>
> I was initially thinking of just processing all of this using Java, but
> I was wondering if there is some way to natively do it in Oracle / SQL,
> possibly using Oracle Intermedia or Text.  I have 10g on Windows.  I
> started trying to use the index tables (e.g. DR$BOOKS_TITLE$K), but I do
> not want to ignore words like "the" and "a," and from what I could see
> this would not give me the option of breaking the analysis down by
> author and category.
>
> Any help is greatly appreciated -- even just a point in the right
> direction (or if you think I should do this some other way, like Java or
> PL/SQL).
>
> Thanks,
>
> John
> jmpcrew@hotmail.com

I am not very familiar with Oracle Text but if I had to do something
like this I think I would be looking at a Merge statement against an
IOT for each word I parsed from a document to get the count of unique
words.

HTH -- Mark D Powell --

