Home » Server Options » Text & interMedia » MULTI_STOPLIST using WORLD_LEXER
MULTI_STOPLIST using WORLD_LEXER [message #391152] Wed, 11 March 2009 04:15 Go to next message
leon_buijsman
Messages: 13
Registered: March 2009
Location: Rotterdam
Junior Member

We are indexing documents of different languages and will be using WORLD_LEXER. WORLD_LEXER does identify the language of a document automatically.

This is all OK, but what to do with the DEFAULT_STOPLIST. This is one language by default (English in our case). There is an option called MULTI_STOPLIST. According to the manual it requires you know the language of a document upfront (before indexing) and its use during queries appears to be "unknown": At query time, the session language setting determines the active stopwords, like it determines the active lexer when using the multi-lexer.

Nothing is stated about the use of MULTI_STOPLIST in combination with WORLD_LEXER.

My questions:
a) Does the MULTI_STOPLIST work together with WORLD_LEXER and is does it use the language the WORLD_LEXER has determined?

If not, is there any alternative apart from building a stoplist ourself with stopwords from different languages in one list?

b) What happens at query time when you are using a webapplication? Will it default to the session language which is English? Or is there some way we can influence that?
Re: MULTI_STOPLIST using WORLD_LEXER [message #391165 is a reply to message #391152] Wed, 11 March 2009 05:23 Go to previous messageGo to next message
Barbara Boehmer
Messages: 7984
Registered: November 2002
Location: California, USA
Senior Member
Quote:

a) Does the MULTI_STOPLIST work together with WORLD_LEXER and does it use the language the WORLD_LEXER has determined?



No, the multi_stoplist requires a language column.

Quote:

If not, is there any alternative apart from building a stoplist ourself with stopwords from different languages in one list?



You can use a multi_lexer instead of world_lexer.

Quote:

b) What happens at query time when you are using a webapplication? Will it default to the session language which is English? Or is there some way we can influence that?



It defaults to the session language. If you are using a multi_lexer, you can specify the language using a query template, but the world_lexer is only affected by the session language.


Re: MULTI_STOPLIST using WORLD_LEXER [message #391246 is a reply to message #391165] Wed, 11 March 2009 09:38 Go to previous messageGo to next message
leon_buijsman
Messages: 13
Registered: March 2009
Location: Rotterdam
Junior Member

Hi Barbara,

Thanks for answering!

Is there any smart way to discover the language of a document without asking the person that uploads the document.

If world_lexer can do it, maybe some other procedure/product feature could do that as well. Although I think I know the anwer to that Sad
Re: MULTI_STOPLIST using WORLD_LEXER [message #391337 is a reply to message #391246] Wed, 11 March 2009 18:10 Go to previous message
Barbara Boehmer
Messages: 7984
Registered: November 2002
Location: California, USA
Senior Member
If you search the internet, you can find some products and ideas that you might be able to use such as:

http://www.lextek.com/langid/li/

http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html

However, I think I might be inclined to use the world_lexer to get the benefits of automatic language detection, but provide a language column that could be used by the multi_stoplist, and request, but not require, that the uploader of a document provide the language of the document. With the multi_stoplist, you can specify any words, such as "a" that you might want to be stopwords in all languages, including documents for which a language was not specified. Then you can specify other words for only individual languages, recognizing that these language-specific stopwords will apply to documents for which the language has been identified and the specified language matches the stopword language, regardless of the session language, and that these language-specific stopwords will apply to all documents whether or not the language is identified when the stopword language matches the session language. For example, the German word "die" is the counterpart of the English word "the", so you would want it to be a German stopword, but not an English stopword. So, you would get more accurate results if the document language and/or session language is specified, and results that include some stopwords when they are not specified. Please see the demonstration below that illustrates this.

SCOTT@orcl_11g> CREATE TABLE test_tab
  2    (id_col	  NUMBER,
  3  	data_col  VARCHAR2 (30),
  4  	lang_col  VARCHAR2 (10))
  5  /

Table created.

SCOTT@orcl_11g> INSERT ALL
  2  INTO test_tab VALUES (1, 'Live and Let Die',  'english')
  3  INTO test_tab VALUES (2, 'Die Katze im Hut',  'german')
  4  INTO test_tab VALUES (3, 'Live and Let Die',  NULL)
  5  INTO test_tab VALUES (4, 'Die Katze im Hut',  NULL)
  6  INTO test_tab VALUES (5, 'a',  NULL)
  7  INTO test_tab VALUES (6, 'a',  'english')
  8  INTO test_tab VALUES (7, 'a',  'german')
  9  SELECT * FROM DUAL
 10  /

7 rows created.

SCOTT@orcl_11g> BEGIN
  2    CTX_DDL.CREATE_PREFERENCE ('test_lex', 'WORLD_LEXER');
  3    CTX_DDL.CREATE_STOPLIST ('test_stop', 'MULTI_STOPLIST');
  4    CTX_DDL.ADD_STOPWORD ('test_stop', 'Die','german');
  5    CTX_DDL.ADD_STOPWORD ('test_stop', 'a','all');
  6  END;
  7  /

PL/SQL procedure successfully completed.

SCOTT@orcl_11g> CREATE INDEX test_idx ON test_tab (data_col)
  2  INDEXTYPE IS CTXSYS.CONTEXT
  3  PARAMETERS
  4    ('LEXER		  test_lex
  5  	 STOPLIST	  test_stop
  6  	 LANGUAGE COLUMN  lang_col')
  7  /

Index created.

SCOTT@orcl_11g> SELECT token_text FROM dr$test_idx$i
  2  /

TOKEN_TEXT
----------------------------------------------------------------
AND
DIE
HUT
IM
KATZE
LET
LIVE

7 rows selected.

SCOTT@orcl_11g> ALTER SESSION SET NLS_LANGUAGE = 'ENGLISH'
  2  /

Session altered.

SCOTT@orcl_11g> SELECT * FROM test_tab WHERE CONTAINS (data_col, 'die') > 0
  2  /

    ID_COL DATA_COL                       LANG_COL
---------- ------------------------------ ----------
         1 Live and Let Die               english
         3 Live and Let Die
         4 Die Katze im Hut

SCOTT@orcl_11g> SELECT * FROM test_tab WHERE CONTAINS (data_col, 'a') > 0
  2  /

no rows selected

SCOTT@orcl_11g> ALTER SESSION SET NLS_LANGUAGE = 'GERMAN'
  2  /

Session altered.

SCOTT@orcl_11g> SELECT * FROM test_tab WHERE CONTAINS (data_col, 'die') > 0
  2  /

no rows selected

SCOTT@orcl_11g> SELECT * FROM test_tab WHERE CONTAINS (data_col, 'a') > 0
  2  /

no rows selected

SCOTT@orcl_11g>






Previous Topic: [Oracle 9i] : Synchronize Oracle Text index on CLOB field
Next Topic: ORA-00904: "CONTAINS": invalid identifier
Goto Forum:
  


Current Time: Thu Sep 18 11:11:19 CDT 2014

Total time taken to generate the page: 0.28554 seconds