Re: Complex CONTEXT index
Date: Fri, 23 Jan 2009 16:13:32 +0000
Don't you need to translate the BLOB content into indexable text before you index it? A simple transliteration of hex values is no help; you need something that would convert the enclosed encoded Word or PDF into real words.
- PDF to text - there are some solutions out there (eg PDFbox<http://www.pdfbox.org/>- an OSS java toolkit; found by Google, no idea if it really works).
- Word to text - you could try eg Apache POI <http://poi.apache.org/>(same reservations, and looks like old Word formats may be poorly served). Obviously this will be much easier once you get to Office Open XML file formats - you can just take the XML and dump the text without markup into your CLOB.
- In both cases, you'd build a BLOB-to-CLOB converter using a Java stored proc.
Once you have indexed the text representation, you can of course discard it (or save some/all of it for preview purposes...)
2009/1/23 Bill Zakrzewski <bill_at_intactus.com>
> Listers -
> Oracle 10.2.0.4.0
> RH Linux
> I have a table (see below) that I would like to create a Context/Intermedia
> index on the title, short_desc, long_desc and the document (BLOB column). I
> have created a similar index on a different table that contained a CLOB by
> concatenating all of the fields into a single CLOB and creating the CONTEXT
> index using the pl/sql package/procedure (see below). I would like to do
> the same thing using the BLOB column, but not sure what values to use in the
> parameters for the DBMS_LOB.CONVERTTOCLOB procedure, specifically the
> BLOB_CSID and LANG_CONTEXT. My concern is the defaults will cause it to
> copy the data in binary format and not convert correctly, as the document
> may be a PDF or WORD Document or Excel Spreadsheet, etc. Thanks in advance
> for your help.