Re: Indexing pdf/doc file contents AND other text data

From: Tim Arnold <timkarnold_at_comcast.net>
Date: Thu, 16 Oct 2008 16:18:25 -0400
Message-ID: <T7idnYen4vQMPGrVnZ2dnUVZ_s7inZ2d@comcast.com>

"www.douglassdavis.com" <douglass_davis_at_earthlink.net> wrote in message news:c0e9bd0a-9ee5-43f6-bf6b-bcaa1aa59c03_at_u65g2000hsc.googlegroups.com...
>
>
> Hello,
>
>
> I am indexing the text from some pdfs and doc files uploaded by the
> user.
>
> This is easy enough to do with
>
> CREATE INDEX file_index ON reports(file_path) INDEXTYPE IS
> ctxsys.context
> PARAMETERS ('datastore ctxsys.file_datastore format column file_path')
>
>
> However, I would also like to store other text data related to the
> file in the same index. For example, a short description of the file,
> the name of the person in charge of maintaining the document etc.
> What is the easiest way to do this?
>
>
> I am considering extracting the text, adding the other metadata,
> putting it in a column, then using a CONTEXT index on the column.
>
> I looked at CTX_DOC.POLICY_FILTER
>
> Which it says "Generates a plain text or an HTML version of a
> document."
>
> The downside of this seems to be, I have to load the entire file in a
> BLOB, then pass it to the procedure. Also, I do not know what file
> types it supports.
>
> Also, I am considering the ctxhx executable. The downside is this
> outputs HTML to a file (I'm using 10g express on Windows and it
> doesn't give a text output option), so I would still have to take the
> HTML, put it in a column, add the other metadata, and even after that,
> I am not sure if it will try to index the HTML markup.
>
> Any suggestions?
>
> Thanks

'Reports' is the table name and file_path is the column you are indexing, right?

I'd just add some additional columns to 'Reports' and create as many additional indices as required.

Tim Received on Thu Oct 16 2008 - 15:18:25 CDT

Original text of this message