Home » Server Options » Text & interMedia » inso_filter with cyrillic characters
inso_filter with cyrillic characters [message #161699] Mon, 06 March 2006 15:50 Go to next message
user493084
Messages: 7
Registered: March 2006
Junior Member
i'm trying to index a column containing a mix of word documents and text documents (in a combination of koi8-r, iso-8859-5 and utf-8) stored in a blob column via oracle text.
(the database has a native charset of AL32UTF8.)
the table looks like this:

BLOB_TEST (
ID NUMBER,
DATA BLOB,
FMT VARCHAR2,
CSET VARCHAR2,
LANG VARCHAR2(10)
)

i set up my index to use inso_filter with the format and charset options.

CREATE INDEX blob_ling ON blob_test(data)
indextype IS ctxsys.CONTEXT
PARAMETERS('datastore ctxsys.direct_datastore
lexer ctxsys.world_lexer
filter ctxsys.inso_filter
stoplist ctxsys.default_stoplist
language column lang
format column fmt
charset column cset' );

however when i try to run full text queries using non-ascii (in this case cyrillic) queries, only the word documents ever have hits.
ie given a word document and a text document containing the exact same cyrillic string, a query for that exact string using contains only returns the word document.

what could be causing this behavior, and how do i actually enable character set filtering?
Re: inso_filter with cyrillic characters [message #164551 is a reply to message #161699] Thu, 23 March 2006 23:38 Go to previous messageGo to next message
rhardman
Messages: 25
Registered: April 2005
Junior Member
Hi,

I noticed that you're using the world_lexer and inso_filter. Most recent versions of Oracle (your use of world_lexer would indicate a 10g implementation) actually use auto_filter (actually the verity keyview filter, but specified as auto_filter in the index creation). This is one problem.

If changing this doesn't resolve the issue for you, check out the tokens that are actually generated (make sure that your client can display the proper glyph - consider using iSQL*Plus with browser encoding set to UTF-8). If the token doesn't exist like you expect, then retrieve the doc and check the contents since I'd suspect that the contents were corrupted in some way.

Hope it helps,

Ron
Re: inso_filter with cyrillic characters [message #164647 is a reply to message #164551] Fri, 24 March 2006 08:58 Go to previous messageGo to next message
user493084
Messages: 7
Registered: March 2006
Junior Member
the database i'm using is a 10g release 1 database, and if i'm reading the oracle documentation right, auto_filter was in release 2, and if i try to change from inso_filter to auto_filter, oracle returns an error indicating that it doesnt know what auto_filter is.

as for the tokens, i'm looking at the table dr$blob_ling$i, and the tokens are all in the original charsets ( ie a mixture of koi8-r, iso-8859-5, win1251, and utf-8 ), which indicates once again that inso_filter is not transcoding the other character sets to unicode.

for kicks, i changed inso_filter to null_filter, and for the plain text documents, i got the same exact tokens as inso_filter returned.

also if i try to do a CTX_DOC.FILTER on the data, all the non-ascii characters come back as unknowns.

if i change the lexer to a basic_lexer, i get the same behavior.
Re: inso_filter with cyrillic characters [message #164651 is a reply to message #164647] Fri, 24 March 2006 09:22 Go to previous messageGo to next message
rhardman
Messages: 25
Registered: April 2005
Junior Member
"which indicates once again that inso_filter is not transcoding the other character sets to unicode."

That isn't what the filter is/does. The filter is for extracting text from docs prior to the lexer breaking it up into tokens. It has nothing to do with charactersets. I assumed that since you were referencing the problem as being a filter issue that the docs needed filtering. Plain text doesn't.

Since the tokens are correct, it is simply a query issue. What are you using for your client?

[Updated on: Fri, 24 March 2006 09:24]

Report message to a moderator

Re: inso_filter with cyrillic characters [message #164658 is a reply to message #164651] Fri, 24 March 2006 09:48 Go to previous messageGo to next message
user493084
Messages: 7
Registered: March 2006
Junior Member
all i'm trying to do is replicate the behavior described in
section 2.3.2.3 "Character Set Conversion With Inso" of the
Oracle Text Reference documentation. accessable at: http://download-west.oracle.com/docs/cd/B14117_01/text.101/b10730/cdatadic.htm#sthref411

which says: "The INSO_FILTER converts documents to the database character set when the document format column is set to TEXT. In this case, the INSO_FILTER looks at the charset column to determine the document character set."

you say that this isnt supposed to be the function of the filter, i say that it is explicitly what the filter is supposed to do. it is supposed to transform the documents (whether they be word documents, or plain text in a non-unicode characterset) and extract unicode text from them, which is then tokenized, lexed, etc.

i can forsee a workaround for this problem, ie to do a full text query for a term, i would only have to do a full text query for each charset representation of the search term. not an elegant or remotely efficient solution.

Re: inso_filter with cyrillic characters [message #164660 is a reply to message #164651] Fri, 24 March 2006 09:55 Go to previous messageGo to next message
user493084
Messages: 7
Registered: March 2006
Junior Member
in addition the text reference says "If you do specify the charset column and do not specify the format column, the INSO_FILTER works like the CHARSET_FILTER, except that in this case there is no Japanese character set auto-detection."

as you can see from my original post i do supply a charset column. if i change from an inso_filter to a charset_filter, the unicode conversion works correctly, ie all tokens are in unicode, however it no longer indexes the non-plain text documents.
Re: inso_filter with cyrillic characters [message #164668 is a reply to message #164658] Fri, 24 March 2006 10:24 Go to previous messageGo to next message
rhardman
Messages: 25
Registered: April 2005
Junior Member
AHHHH!

I finally understand your question!!

From your original post and subsequent one that you were fine with the tokens and the search was just not returning what you wanted...and somehow the filter was supposed to interact with the query to do some interactive filtering for character sets...hence my confusion about your purpose in focusing on the filter.

So, to back up a bit...

The beginning of that section says:
"This filter automatically bypasses plain-text, HTML, and XML documents."

As a test, toss in a format column and specify binary for all of them. Include that format column as a parameter and see if it will in fact run it through the filter.

If that doesn't work, do check out the auto_filter. It is in:

9.2.0.7, 10.1.0.4, and 10.2. You said that you're 10.1, but obviously not patched up high enough to get the new filter since you're getting an error.

-Ron

[Updated on: Fri, 24 March 2006 10:25]

Report message to a moderator

Re: inso_filter with cyrillic characters [message #164676 is a reply to message #164668] Fri, 24 March 2006 11:05 Go to previous messageGo to next message
user493084
Messages: 7
Registered: March 2006
Junior Member
you may be on to something. no matter what i put in the format column, whether it be 'IGNORE', 'TEXT', 'BINARY', or a random word, the index seems to ignore the column, and return the same results in each case.

Re: inso_filter with cyrillic characters [message #164678 is a reply to message #164676] Fri, 24 March 2006 11:34 Go to previous message
rhardman
Messages: 25
Registered: April 2005
Junior Member
If you keep the charset column in the parameters but drop the fmt column from index creation (leave inso_filter in there), what happens (according to orcl docs and your earlier post it should behave like charset filter)?

I'm wondering if the result will the same as what you noted earlier when explicitly changing the filter (only plain text is indexed) or if it will handle all doc types.

Just a thought...

[Updated on: Fri, 24 March 2006 11:37]

Report message to a moderator

Previous Topic: ctxsys.markup tag problem
Next Topic: oracle markup drsxsopen error
Goto Forum:
  


Current Time: Thu Nov 27 03:57:33 CST 2014

Total time taken to generate the page: 0.08163 seconds