Oracle FAQ | Your Portal to the Oracle Knowledge Grid |
Home -> Community -> Usenet -> c.d.o.server -> Re: Oracle Text: Indexing UTF8 or UTF16
Server Applications wrote:
> Hello
>
> I am trying to build a system where I can full-text index documents with
> UTF8 or UTF16 data using Oracle Text. I am doing the filtering in a
> third-party component outside the database, so the I dont need filtering in
> Oracle, but only indexing.
> If I put file references to the filtered files in the database and index
> these (using FILE_DATASTORE), everything works fine. But I rather put the
> filtered data in the database, and index it from here (using the
> PROCECURE_FILTER). But this gives me some problems when the data is actually
> unicode data.
> The interface for the procedure in the PROCEDURE_FILTER does not allow the
> data to be output as NCLOB or NVARCHAR, but only CLOB or VARCHAR. Indexing
> the data directly in the table (using eg. an NULL_FILTER or CHARSET_FILTER)
> have the same impact. If I try to index a column of the type NCLOB or
> NVARCHAR, the index-creation gives me an error telling me that it is an
> invalid column-type.
>
> I have tried to create a database with the UTF8 character set, expecting
> that the CLOB column type then could contain the UTF8 data, and that the
> indexing then would recognize the unicode characters in the data. This does
> not give any errors, but none of the unicode string in the data are
> contained in the index, only the strings in english (or ascii, strings with
> characters all within 1 byte) are contained in the index afterwards.
>
> Is is not possible to index data directly in a column (using either
> CHARSET_FILTER, NULL_FILTER or PROCEDURE_FILTER) that is in UTF8 or UTF16
> format?
>
>
> Thanks in advance for any comments.
>
> /David
>
>
What language did you install for Oracle Text?
The default is (US) English. You probably want to install
multiple languages.
If I understand your post correctly, the data loaded is *not* UTF; so actually, this is not about Text, but about character sets (UTF or a fixed-8-byte character set)
Please post versions.
-- Regards, Frank van BortelReceived on Thu May 19 2005 - 05:49:42 CDT