Indexing pdf/doc file contents AND other text data
Date: Thu, 16 Oct 2008 06:25:26 -0700 (PDT)
I am indexing the text from some pdfs and doc files uploaded by the user.
This is easy enough to do with
CREATE INDEX file_index ON reports(file_path) INDEXTYPE IS
PARAMETERS ('datastore ctxsys.file_datastore format column file_path')
However, I would also like to store other text data related to the file in the same index. For example, a short description of the file, the name of the person in charge of maintaining the document etc. What is the easiest way to do this?
I am considering extracting the text, adding the other metadata, putting it in a column, then using a CONTEXT index on the column.
I looked at CTX_DOC.POLICY_FILTER
Which it says "Generates a plain text or an HTML version of a document."
The downside of this seems to be, I have to load the entire file in a BLOB, then pass it to the procedure. Also, I do not know what file types it supports.
Also, I am considering the ctxhx executable. The downside is this outputs HTML to a file (I'm using 10g express on Windows and it doesn't give a text output option), so I would still have to take the HTML, put it in a column, add the other metadata, and even after that, I am not sure if it will try to index the HTML markup.
Thanks Received on Thu Oct 16 2008 - 08:25:26 CDT