Re: Complex CONTEXT index

From: Nigel Thomas <nigel.cl.thomas_at_googlemail.com>
Date: Fri, 23 Jan 2009 16:13:32 +0000
Message-ID: <53258cd50901230813u1b53dc1ey83860e640dcb95b8_at_mail.gmail.com>

Bill

Don't you need to translate the BLOB content into indexable text before you index it? A simple transliteration of hex values is no help; you need something that would convert the enclosed encoded Word or PDF into real words.

PDF to text - there are some solutions out there (eg PDFbox<http://www.pdfbox.org/>- an OSS java toolkit; found by Google, no idea if it really works).
Word to text - you could try eg Apache POI <http://poi.apache.org/>(same reservations, and looks like old Word formats may be poorly served). Obviously this will be much easier once you get to Office Open XML file formats - you can just take the XML and dump the text without markup into your CLOB.
In both cases, you'd build a BLOB-to-CLOB converter using a Java stored proc.

Once you have indexed the text representation, you can of course discard it (or save some/all of it for preview purposes...)

Regards Nigel

2009/1/23 Bill Zakrzewski <bill_at_intactus.com>

> Listers -
> Oracle 10.2.0.4.0
> RH Linux
>
> I have a table (see below) that I would like to create a Context/Intermedia
> index on the title, short_desc, long_desc and the document (BLOB column). I
> have created a similar index on a different table that contained a CLOB by
> concatenating all of the fields into a single CLOB and creating the CONTEXT
> index using the pl/sql package/procedure (see below). I would like to do
> the same thing using the BLOB column, but not sure what values to use in the
> parameters for the DBMS_LOB.CONVERTTOCLOB procedure, specifically the
> BLOB_CSID and LANG_CONTEXT. My concern is the defaults will cause it to
> copy the data in binary format and not convert correctly, as the document
> may be a PDF or WORD Document or Excel Spreadsheet, etc. Thanks in advance
> for your help.
>
>

--
http://www.freelists.org/webpage/oracle-l

Received on Fri Jan 23 2009 - 10:13:32 CST

This message: [ Message body ]
Next message: D'Hooge Freek: "RE: Thoughts on crs installation on HP-UX"
Previous message: Bill Zakrzewski: "Complex CONTEXT index"
In reply to: Bill Zakrzewski: "Complex CONTEXT index"
Next in thread: Rich Jesse: "Re: Complex CONTEXT index"
Reply: Rich Jesse: "Re: Complex CONTEXT index"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

Original text of this message