Indexing pdf/doc file contents AND other text data

From: www.douglassdavis.com <douglass_davis_at_earthlink.net>
Date: Thu, 16 Oct 2008 06:25:26 -0700 (PDT)
Message-ID: <c0e9bd0a-9ee5-43f6-bf6b-bcaa1aa59c03@u65g2000hsc.googlegroups.com>

Hello,

I am indexing the text from some pdfs and doc files uploaded by the user.

This is easy enough to do with

CREATE INDEX file_index ON reports(file_path) INDEXTYPE IS ctxsys.context
PARAMETERS ('datastore ctxsys.file_datastore format column file_path')

However, I would also like to store other text data related to the file in the same index. For example, a short description of the file, the name of the person in charge of maintaining the document etc. What is the easiest way to do this?

I am considering extracting the text, adding the other metadata, putting it in a column, then using a CONTEXT index on the column.

I looked at CTX_DOC.POLICY_FILTER

Which it says "Generates a plain text or an HTML version of a document."

The downside of this seems to be, I have to load the entire file in a BLOB, then pass it to the procedure. Also, I do not know what file types it supports.

Also, I am considering the ctxhx executable. The downside is this outputs HTML to a file (I'm using 10g express on Windows and it doesn't give a text output option), so I would still have to take the HTML, put it in a column, add the other metadata, and even after that, I am not sure if it will try to index the HTML markup.

Any suggestions?

Thanks Received on Thu Oct 16 2008 - 08:25:26 CDT

Original text of this message