Re: Indexing pdf/doc file contents AND other text data
Date: Fri, 17 Oct 2008 22:07:03 -0400
"www.douglassdavis.com" <douglass_davis_at_earthlink.net> wrote in message
On Oct 16, 4:18 pm, "Tim Arnold" <timkarn..._at_comcast.net> wrote:
> "www.douglassdavis.com" <douglass_da..._at_earthlink.net> wrote in message
> > Hello,
> > I am indexing the text from some pdfs and doc files uploaded by the
> > user.
> > This is easy enough to do with
> > CREATE INDEX file_index ON reports(file_path) INDEXTYPE IS
> > ctxsys.context
> > PARAMETERS ('datastore ctxsys.file_datastore format column file_path')
> > However, I would also like to store other text data related to the
> > file in the same index. For example, a short description of the file,
> > the name of the person in charge of maintaining the document etc.
> > What is the easiest way to do this?
> > I am considering extracting the text, adding the other metadata,
> > putting it in a column, then using a CONTEXT index on the column.
> > I looked at CTX_DOC.POLICY_FILTER
> > Which it says "Generates a plain text or an HTML version of a
> > document."
> > The downside of this seems to be, I have to load the entire file in a
> > BLOB, then pass it to the procedure. Also, I do not know what file
> > types it supports.
> > Also, I am considering the ctxhx executable. The downside is this
> > outputs HTML to a file (I'm using 10g express on Windows and it
> > doesn't give a text output option), so I would still have to take the
> > HTML, put it in a column, add the other metadata, and even after that,
> > I am not sure if it will try to index the HTML markup.
> > Any suggestions?
> > Thanks
> 'Reports' is the table name and file_path is the column you are indexing,
> I'd just add some additional columns to 'Reports' and create as many
> additional indices as required.
i thought about that, but how would i combine the two? Would I just add the score from the two CTXSYS.CONTAINS clauses and sort by that? How well would that work?
Well, how would you deal with the 2 scores in any event? If they are in separate tables, you'll have the problem of different Primary keys and have to mess with a join, which I'd anticipate much more unwieldy. Received on Fri Oct 17 2008 - 21:07:03 CDT