Re: Word Frequencies

From: Frank van Bortel <frank.van.bortel_at_gmail.com>
Date: 7 Jun 2006 02:10:24 -0700
Message-ID: <1149671424.470859.33620@y43g2000cwc.googlegroups.com>

John Paulett schreef:

> Hello,
>
> I am trying to analyze word frequencies in articles (stored in clob's).
> There are several million articles, each which has an average length of
> about 150 words. My goal is to eventually break the data down by the
> author and category of the article to see how many unique words are used
> (I would need a count of the number of times each unique word appears).
>
> I was initially thinking of just processing all of this using Java, but
> I was wondering if there is some way to natively do it in Oracle / SQL,
> possibly using Oracle Intermedia or Text. I have 10g on Windows. I
> started trying to use the index tables (e.g. DR$BOOKS_TITLE$K), but I do
> not want to ignore words like "the" and "a," and from what I could see
> this would not give me the option of breaking the analysis down by
> author and category.
>
> Any help is greatly appreciated -- even just a point in the right
> direction (or if you think I should do this some other way, like Java or
> PL/SQL).
>

Mark pointed you in the right direction; and you already indicated it yourself:
Oracle Text.
The default (US) stoplist has 'a' and 'the', so these words are ignored.
You can query, and enhance your default stoplist, or create one of your own alltogether.
Problem is that you would need a list of words first, before you can count 'em.
Don't really see how to approach that... Received on Wed Jun 07 2006 - 04:10:24 CDT