Re: Indexing pdf/doc file contents AND other text data

From: www.douglassdavis.com <douglass_davis_at_earthlink.net>
Date: Thu, 16 Oct 2008 19:54:48 -0700 (PDT)
Message-ID: <4817e2c7-228b-488b-ad4d-cba4149c9fc0@l77g2000hse.googlegroups.com>


On Oct 16, 4:18 pm, "Tim Arnold" <timkarn..._at_comcast.net> wrote:
> "www.douglassdavis.com" <douglass_da..._at_earthlink.net> wrote in message
>
> news:c0e9bd0a-9ee5-43f6-bf6b-bcaa1aa59c03_at_u65g2000hsc.googlegroups.com...
>
>
>
>
>
> > Hello,
>
> > I am indexing the text from some pdfs and doc files uploaded by the
> > user.
>
> > This is easy enough to do with
>
> > CREATE INDEX file_index ON reports(file_path) INDEXTYPE IS
> > ctxsys.context
> > PARAMETERS ('datastore ctxsys.file_datastore format column file_path')
>
> > However, I would also like to store other text data related to the
> > file in the same index.  For example, a short description of the file,
> > the name of the person in charge of maintaining the document etc.
> > What is the easiest way to do this?
>
> > I am considering extracting the text, adding the other metadata,
> > putting it in a column, then using a CONTEXT index on the column.
>
> > I looked at CTX_DOC.POLICY_FILTER
>
> > Which it says "Generates a plain text or an HTML version of a
> > document."
>
> > The downside of this seems to be, I have to load the entire file in a
> > BLOB, then pass it to the procedure.  Also, I do not know what file
> > types it supports.
>
> > Also, I am considering the ctxhx executable.  The downside is this
> > outputs HTML to a file (I'm using 10g express on Windows and it
> > doesn't give a text output option),  so I would still have to take the
> > HTML, put it in a column, add the other metadata, and even after that,
> > I am not sure if it will try to index the HTML markup.
>
> > Any suggestions?
>
> > Thanks
>
> 'Reports' is the table name and file_path is the column you are indexing,
> right?

yes

> I'd just add some additional columns to 'Reports' and create as many
> additional indices as required.

i thought about that, but how would i combine the two? Would I just add the score from the two CTXSYS.CONTAINS clauses and sort by that? How well would that work? Received on Thu Oct 16 2008 - 21:54:48 CDT

Original text of this message