Re: Indexing pdf/doc file contents AND other text data

From: Vladimir M. Zakharychev <vladimir.zakharychev_at_gmail.com>
Date: Fri, 17 Oct 2008 01:19:25 -0700 (PDT)
Message-ID: <3e7db624-fa89-4143-87f1-190639d72272@8g2000hse.googlegroups.com>


On Oct 16, 5:25 pm, "www.douglassdavis.com" <douglass_da..._at_earthlink.net> wrote:
> Hello,
>
> I am indexing the text from some pdfs and doc files uploaded by the
> user.
>
> This is easy enough to do with
>
> CREATE INDEX file_index ON reports(file_path) INDEXTYPE IS
> ctxsys.context
> PARAMETERS ('datastore ctxsys.file_datastore format column file_path')
>
> However, I would also like to store other text data related to the
> file in the same index.  For example, a short description of the file,
> the name of the person in charge of maintaining the document etc.
> What is the easiest way to do this?
>
> I am considering extracting the text, adding the other metadata,
> putting it in a column, then using a CONTEXT index on the column.
>
> I looked at CTX_DOC.POLICY_FILTER
>
> Which it says "Generates a plain text or an HTML version of a
> document."
>
> The downside of this seems to be, I have to load the entire file in a
> BLOB, then pass it to the procedure.  Also, I do not know what file
> types it supports.
>
> Also, I am considering the ctxhx executable.  The downside is this
> outputs HTML to a file (I'm using 10g express on Windows and it
> doesn't give a text output option),  so I would still have to take the
> HTML, put it in a column, add the other metadata, and even after that,
> I am not sure if it will try to index the HTML markup.
>
> Any suggestions?
>
> Thanks

If all this information is stored in the same table, MULTI_COLUMN_DATASTORE may do for you. Text will concatenate all specified columns into a synthetic document and index it automatically. It does not support joins though (it can call functions, so you may be able to employ implicit joins, but it's not really very efficient;) so if you have several tables you want to merge into the index, you'll need to resort to USER_DATASTORE and write your own document synthesizer procedure. Look up these datastore types in the docs for the details, and I recall posting an example of USER_DATASTORE solution a few years ago into this very group.

Also, how do you maintain the index? Since the content is external, Oracle can't actually control the changes made to it externally - how do you deal with this? I mean, when someone changes the file, Oracle has no way figuring out it was changed because it only stores a pointer to this file and this pointer didn't change. Any nontransactional  change breaks the index, because it still holds [wrong] data for previous version of the changed document.

Hth,

   Vladimir M. Zakharychev
   N-Networks, makers of Dynamic PSP(tm)    http://www.dynamicpsp.com Received on Fri Oct 17 2008 - 03:19:25 CDT

Original text of this message