Home » Server Options » Text & interMedia » Multi Language in Oracle Text (Oracle10.2)
Multi Language in Oracle Text [message #339472] Thu, 07 August 2008 10:06 Go to next message
McLaud
Messages: 15
Registered: August 2008
Junior Member
Hello All,
I have a problem and looking for the ways, how to solve it.
Basically I have a table

create table Docs (
Id number,
Text_data varchar2(200)
);

The Text_data column may contains both English and French data.
I am going to build Oracle Text on my Docs table with requirement: If user enters English word (for example ‘home’) the search result set must contains all rows, contains ‘home’ together with all French rows, contains equivalent of ‘home’ – ‘maison’
How to manage this in Oracle Text?
Re: Multi Language in Oracle Text [message #339492 is a reply to message #339472] Thu, 07 August 2008 11:24 Go to previous messageGo to next message
Barbara Boehmer
Messages: 7990
Registered: November 2002
Location: California, USA
Senior Member
You will need to create a thesaurus with english and french words. You can either create your own or purchase one from somewhere in text format. The words can be loaded individually, or you can loop through a table of words to load them or you can use ctxload to load a text file containing them in a specified format. Once you have the thesaurus, you can then use ctx_thes.trsyn to retrieve translations of words or phrases that you have entered and trsyn for searches. I have written a function that returns combinations of translations from a thesaurus, so that you can search for strings of words easily. Please see the demonstration below.


SCOTT@orcl_11g> -- table and data and index for demo:
SCOTT@orcl_11g> create table Docs (
  2  Id number,
  3  Text_data varchar2(200)
  4  );

Table created.

SCOTT@orcl_11g> INSERT ALL
  2  INTO docs VALUES (1, 'house')
  3  INTO docs VALUES (2, 'maison')
  4  INTO docs VALUES (3, 'The Cat in the Hat')
  5  INTO docs VALUES (4, 'Le Chat dans le Chapeau')
  6  SELECT * FROM DUAL
  7  /

4 rows created.

SCOTT@orcl_11g> CREATE INDEX docs_idx ON docs (text_data)
  2  INDEXTYPE IS CTXSYS.CONTEXT
  3  /

Index created.

SCOTT@orcl_11g> -- create thesaurus and enter a few translations:
SCOTT@orcl_11g> BEGIN
  2    CTX_THES.CREATE_THESAURUS ('english_french');
  3    CTX_THES.CREATE_PHRASE ('english_french', 'house');
  4    CTX_THES.CREATE_TRANSLATION ('english_french', 'house', 'FRENCH', 'maison');
  5    CTX_THES.CREATE_PHRASE ('english_french', 'cat');
  6    CTX_THES.CREATE_TRANSLATION ('english_french', 'cat', 'FRENCH', 'chat');
  7    CTX_THES.CREATE_PHRASE ('english_french', 'hat');
  8    CTX_THES.CREATE_TRANSLATION ('english_french', 'hat', 'FRENCH', 'chapeau');
  9  END;
 10  /

PL/SQL procedure successfully completed.

SCOTT@orcl_11g> -- select translations from thesaurus:
SCOTT@orcl_11g> SELECT CTX_THES.TRSYN ('house', 'FRENCH', 'english_french') FROM DUAL
  2  /

CTX_THES.TRSYN('HOUSE','FRENCH','ENGLISH_FRENCH')
--------------------------------------------------------------------------------
{HOUSE}|{MAISON}

SCOTT@orcl_11g> -- use translations in search of table:
SCOTT@orcl_11g> COLUMN text_data FORMAT A30 WORD_WRAPPED
SCOTT@orcl_11g> SELECT * FROM docs
  2  WHERE  CONTAINS (text_data, 'TRSYN (house, FRENCH, english_french)') > 0
  3  /

        ID TEXT_DATA
---------- ------------------------------
         1 house
         2 maison

SCOTT@orcl_11g> -- function to translate multiple words:
SCOTT@orcl_11g> create or replace function trsyns
  2    (p_words in varchar2,
  3  	p_lang	in varchar2 default 'FRENCH',
  4  	p_thes	in varchar2 default 'english_french',
  5  	p_num	in number   default 2)
  6  	return varchar2
  7  as
  8    v_words_in      varchar2 (32767) := ltrim (p_words) || ' ';
  9    v_words_out     varchar2 (32767);
 10  begin
 11    while instr (v_words_in, '  ') > 0 loop
 12  	 v_words_in := replace (v_words_in, '  ', ' ');
 13    end loop;
 14    while length (v_words_in) > 1
 15    loop
 16  	 for i in reverse 1 .. least (p_num, (length (v_words_in) - length (replace (v_words_in, ' ', ''))))
 17  	 loop
 18  	   if instr
 19  		(ctx_thes.trsyn
 20  		  (substr (v_words_in, 1, instr (v_words_in, ' ', 1, i) - 1),
 21  		   p_lang, p_thes), '|'
 22  		) > 0
 23  	     or i = 1 then
 24  	     v_words_out := v_words_out
 25  	     || ' AND ('
 26  	     || ctx_thes.trsyn
 27  		  (substr (v_words_in, 1, instr (v_words_in, ' ', 1, i) - 1),
 28  		   p_lang, p_thes)
 29  	     || ')';
 30  	     v_words_in := substr (v_words_in, instr (v_words_in, ' ', 1, i) + 1);
 31  	     exit;
 32  	   end if;
 33  	 end loop;
 34    end loop;
 35    return ltrim (v_words_out, ' AND ');
 36  end trsyns;
 37  /

Function created.

SCOTT@orcl_11g> show errors
No errors.
SCOTT@orcl_11g> -- example of what the function returns:
SCOTT@orcl_11g> SELECT trsyns ('cat hat') FROM DUAL
  2  /

TRSYNS('CATHAT')
--------------------------------------------------------------------------------
({CAT}|{CHAT}) AND ({HAT}|{CHAPEAU})

SCOTT@orcl_11g> -- usage of the function in a search:
SCOTT@orcl_11g> SELECT * FROM docs
  2  WHERE  CONTAINS (text_data, trsyns ('cat hat')) > 0
  3  /

        ID TEXT_DATA
---------- ------------------------------
         3 The Cat in the Hat
         4 Le Chat dans le Chapeau

SCOTT@orcl_11g> 

Re: Multi Language in Oracle Text [message #339803 is a reply to message #339492] Fri, 08 August 2008 10:02 Go to previous messageGo to next message
McLaud
Messages: 15
Registered: August 2008
Junior Member
Hello Barbara,
Thank you very much for your replay, I am very impressed!

I have next couple of questions

1. The thesaurus.
a. Does Oracle provide any translations (English/French) thesaurus for free?
b. I did not find any default thesaurus under Oracle database (10.2.0.3.0) - samples folder under ctx.
c. Do you know any resources, providing thesaurus for free?
2. Multi Language
As I say before, my Text_Data column will contains both English and French data. In case, if I failed to discover translation thesaurus, I need just to find rows with special keyword from both French and English rows (for example both rows in English and French, which contains ‘home’).
What is your opinion, which kind of LEXER type I need to apply for Text_Data column WORLD_LEXER or MULTI_LEXER ?
WORLD_LEXER example:

exec ctx_ddl.create_preference('MYLEXER', 'world_lexer');
create index doc_idx on docs(text_data) indextype is CONTEXT parameters ('lexer MYLEXER’);


MULTY_LEXER example
I found some examples on http://youngcow.net/doc/oracle10g/text.102/b14218/cdatadic.htm#i1007955
They are more complicated than WORLD_LEXER.

3. The Docs table has a row:
Insert into Docs(id, text_data) values (1, ‘Mr Richard Stephen Jones comes to house’);
What is your opinion about how to implement the following requirements?
3.1 Abbreviated words. Any of the name attributes are abbreviated, so I need to identify row “Mr R.S. Jones”
3.2 The order of the name attributes are changed - i.e. Surname first “Jones Richard Smith”
3.3 Any of the name attributes are misspelt -i.e. “Richrd”, “Rchard”, “Stephn” and “Stepen”.
3.4 Any of the name attributes are a known nickname for the base name -i.e. Dick is a nickname of Richard; Steven is a nickname of Stephen.

4. Multi-Language Stoplists
If my default database NLS settings are different then English and French (for example Russian), how I can introduce stoplist for English and French?
I found stop lists under Oracle Installation folder for French (ctx\admin\defaults\drdeff.sql) and English (ctx\admin\defaults\drdefgb.sql).
What I need to change in those scripts, and how to tell Oracle text to use it for specific language?

Thank you for you reply!

[Updated on: Fri, 08 August 2008 10:11]

Report message to a moderator

Re: Multi Language in Oracle Text [message #339824 is a reply to message #339803] Fri, 08 August 2008 15:05 Go to previous messageGo to next message
Barbara Boehmer
Messages: 7990
Registered: November 2002
Location: California, USA
Senior Member
1. The thesaurus.
a. Does Oracle provide any translations (English/French) thesaurus for free?

Not that I know of.


b. I did not find any default thesaurus under Oracle database (10.2.0.3.0) - samples folder under ctx.

In 10g Enterprise Edition, you have to download the "companion cd". The supplied thesauri are in that. There is no supplied thesaurus for the 10g Express Edition. In 11g, the thesauri are in the "examples". These are very minimal and intended as examples and not complete thesauri for production usage.


c. Do you know any resources, providing thesaurus for free?

No, but if you search the web you might find something. You might search for english french dictionary, instead of thesaurus, to find what you are looking for. There are also sites like http://www.freetranslation.com/ that do free translation. You might be able to enter a numbered list of English words and obtain a numbered list of french words. You could then load those joining on the numbers. This could work if you only have a few search terms. Check your results carefully. Take your french results, re-enter them to translate from french to english, then compare to see if you got what you started with. A lot of things have multiple meanings and results can be funny, especially with things like computer terminology. For example, java programming can end up getting translated to something like a coffee beverage.


2. Multi Language
As I say before, my Text_Data column will contains both English and French data. In case, if I failed to discover translation thesaurus, I need just to find rows with special keyword from both French and English rows (for example both rows in English and French, which contains ‘home’).
What is your opinion, which kind of LEXER type I need to apply for Text_Data column WORLD_LEXER or MULTI_LEXER ?
WORLD_LEXER example:
exec ctx_ddl.create_preference('MYLEXER', 'world_lexer');
create index doc_idx on docs(text_data) indextype is CONTEXT parameters ('lexer MYLEXER’);
MULTY_LEXER example
I found some examples on http://youngcow.net/doc/oracle10g/text.102/b14218/cdatadic.htm#i1007955
They are more complicated than WORLD_LEXER.

You could use either one. The multi_lexer requires a language column. The new world_lexer detects the languages automatically, so it is simpler to use, so I would use that if you can.


3. The Docs table has a row:
Insert into Docs(id, text_data) values (1, ‘Mr Richard Stephen Jones comes to house’);
What is your opinion about how to implement the following requirements?
3.1 Abbreviated words. Any of the name attributes are abbreviated, so I need to identify row “Mr R.S. Jones”

You can use wildcards with Oracle Text, so you could search for 'Mr R% S% Jones'. By default, punctuations such as periods are ignored and not tokenized.


3.2 The order of the name attributes are changed - i.e. Surname first “Jones Richard Smith”

No problem. Search for 'Jones AND Richard AND Smith' and it will find 'Richard Smith Jones' in any order.


3.3 Any of the name attributes are misspelt -i.e. “Richrd”, “Rchard”, “Stephn” and “Stepen”.

You can use Oracle Text features such as fuzzy to search for similar spellings or soundex to search for names with similar sound. If you use ACCUM and order by score or use progress query relaxation, if some of the words match you will still get a result, just further down on the list. You can use string comparisons like utl_match.edit_distance and utl_match.jaro_winkler_similarity to further limit the result set.


3.4 Any of the name attributes are a known nickname for the base name -i.e. Dick is a nickname of Richard; Steven is a nickname of Stephen.

Some like Dick and Richard would need to be entered as synonyms in your thesaurus. Others could be found using soundex or fuzzy.


4. Multi-Language Stoplists
If my default database NLS settings are different then English and French (for example Russian), how I can introduce stoplist for English and French?
I found stop lists under Oracle Installation folder for French (ctx\admin\defaults\drdeff.sql) and English (ctx\admin\defaults\drdefgb.sql).
What I need to change in those scripts, and how to tell Oracle text to use it for specific language?

When you install Oracle Text from the "companion cd", by default the stoplist in the same language as your database is made the default stoplist. This can be overridden by running certain files like drdeffr.sql to make French the default language or drdefus.sql to make American the default language for Oracle Text. These can be found in <your oracle home directory>\ctx\admin\defaults, once you have installed them from the companion cd. You should probably check with Oracle support before running these and they will only affect a single language default. You will need to create a single stoplist of type multi_stoplist. What I might do is copy and paste and modify some of the code just to get the words from the files I just mentioned and create my own text file to populate the stoplist. You should be aware that there are definite differences between UK English and US American and between French in France and French in Canada and there are separate sql files and stoplists for each. Another alternative is to copy and paste words from the documentation:

http://download.oracle.com/docs/cd/B19306_01/text.102/b14218/astopsup.htm#CCREF1400

to a text file, use SQL*Loader or an external table to put them in a table, then use a pl/sql block to loop through that table to load them into your stoplist.








Re: Multi Language in Oracle Text [message #339839 is a reply to message #339472] Fri, 08 August 2008 16:19 Go to previous messageGo to next message
McLaud
Messages: 15
Registered: August 2008
Junior Member
I love you, Barbara!
Re: Multi Language in Oracle Text [message #341185 is a reply to message #339824] Sun, 17 August 2008 13:36 Go to previous messageGo to next message
McLaud
Messages: 15
Registered: August 2008
Junior Member
Hello Barbara
I play with WORLD_LEXER and it does not help, I create CONTEXT index with WORLD_LEXER, but cannot locate French data. So finally my code is
drop table Docs;

create table Docs (
    Id number,
    Text_data varchar2(200)
    );


Insert into Docs(id, text_data) values (1, 'Mr Richard Stephen Jones comes to house');
Insert into Docs(id, text_data) values (2, 'Mr R.S. Jones comes to house');
Insert into Docs(id, text_data) values (3, 'Mr Jones Richard Stephen comes to house');
Insert into Docs(id, text_data) values (4, 'Mr Richrd Stephen Jones comes to house');
Insert into Docs(id, text_data) values (5, 'Mr Richrd Stephn Jones comes to house');
--- Enter some French chars
Insert into Docs(id, text_data) values (6, 'Azerty wxcvbn');

drop index docs_idx;

create index docs_idx on docs ( text_data )
  indextype is ctxsys.context;


exec ctx_ddl.sync_index('docs_idx');

select token_text from dr$docs_idx$i;

COLUMN text_data FORMAT A30 WORD_WRAPPED


--  Steam
select * from docs where contains(text_data, '$R', 1)>0;
-- Using soundex (!) funtion
select * from docs where contains(text_data, '!Richrd', 1)>0;
select * from docs where contains(text_data, '!Richrd !Stephn', 1)>0;
select * from docs where contains(text_data, 'Richrd !Stephn', 1)>0;
select * from docs where contains(text_data, 'fuzzy(Richrd, 50, 5, weight)', 1) > 0;

--  AND
select * from docs where contains(text_data, 'Jones AND Richard AND Stephen', 1)>0;

--- Test French data

select * from docs where contains(text_data, 'wxcvbn', 1)>0;



The problem is that I cannot find right solution for a abbreviated words, locating R.S. keywords, I use STEAM function, but sure it right.
How exactly to find R.S. ?
Re: Multi Language in Oracle Text [message #341190 is a reply to message #341185] Sun, 17 August 2008 16:25 Go to previous message
Barbara Boehmer
Messages: 7990
Registered: November 2002
Location: California, USA
Senior Member
It helps if you provide the results that you are getting and not just the queries as I may get different results on my system than you get on yours, due to different versions and default settings and such.

As stated previously, if you want to search for anything that begins with "R" followed by anything that begins with "S", then you need to use a wildcard "%" in your search like "R% S%". Also, "S" is a default stopword, so if you want to be able to search for "S", then you need to make sure that it is not in your stoplist.

Stemming is for words with the same linguistic root, like plurals and conjugations, and such. For example "come", "came", and "coming" all have the same linguistic root. Stemming does not mean words that just start with the same letters.

Soundex is for words that sound alike. Soundex does not mean words that just start with the same letters. The whole words must sound alike.

If you want to use multi-language stemming with the world lexer, then you need to use a basic_wordlist with auto stemmer. In 10g, which stemmer is used at query time is determined by the nls_language for the session.

Please see the demonstration below.


SCOTT@orcl_11g> create table Docs (
  2  	 Id number,
  3  	 Text_data varchar2(200)
  4  	 )
  5  /

Table created.

SCOTT@orcl_11g> begin
  2    Insert into Docs(id, text_data) values (1, 'Mr Richard Stephen Jones comes to house');
  3    Insert into Docs(id, text_data) values (2, 'Mr R.S. Jones comes to house');
  4    Insert into Docs(id, text_data) values (3, 'Mr Jones Richard Stephen comes to house');
  5    Insert into Docs(id, text_data) values (4, 'Mr Richrd Stephen Jones comes to house');
  6    Insert into Docs(id, text_data) values (5, 'Mr Richrd Stephn Jones comes to house');
  7    --- Enter some French chars
  8    Insert into Docs(id, text_data) values (6, 'Azerty wxcvbn');
  9    -- additional test data
 10    Insert into Docs(id, text_data) values (7, 'The cats chased the mice.');
 11    Insert into Docs(id, text_data) values (8, 'Les chats ont chass les souris.');
 12    Insert into Docs(id, text_data) values (9, 'The cat was chasing a mouse.');
 13    Insert into Docs(id, text_data) values (10, 'Le chat chassait une souris.');
 14  end;
 15  /

PL/SQL procedure successfully completed.

SCOTT@orcl_11g> -- world_lexer and basic_wordlist with auto stemmer
SCOTT@orcl_11g> begin
  2    ctx_ddl.create_preference ('your_world_lex', 'world_lexer');
  3    ctx_ddl.create_preference ('your_wordlist', 'basic_wordlist');
  4    ctx_ddl.set_attribute ('your_wordlist', 'stemmer', 'auto');
  5  end;
  6  /

PL/SQL procedure successfully completed.

SCOTT@orcl_11g> -- index with lexer, wordlist, and empty stoplist
SCOTT@orcl_11g> create index docs_idx on docs ( text_data )
  2    indextype is ctxsys.context
  3    parameters
  4  	 ('lexer	    your_world_lex
  5  	   wordlist	    your_wordlist
  6  	   stoplist	    ctxsys.empty_stoplist')
  7  /

Index created.

SCOTT@orcl_11g> select token_text from dr$docs_idx$i
  2  /

TOKEN_TEXT
----------------------------------------------------------------
A
AZERTY
CAT
CATS
CHASED
CHASING
CHASSAIT
CHASS
CHAT
CHATS
COMES
HOUSE
JONES
LE
LES
MICE
MOUSE
MR
ONT
R
RICHARD
RICHRD
S
SOURIS
STEPHEN
STEPHN
THE
TO
UNE
WAS
WXCVBN

31 rows selected.

SCOTT@orcl_11g> COLUMN text_data FORMAT A60 WORD_WRAPPED
SCOTT@orcl_11g> --  Stem
SCOTT@orcl_11g> select * from docs where contains(text_data, '$R', 1)>0
  2  /

        ID TEXT_DATA
---------- ------------------------------------------------------------
         2 Mr R.S. Jones comes to house

SCOTT@orcl_11g> -- Using soundex (!) funtion
SCOTT@orcl_11g> select * from docs where contains(text_data, '!Richrd', 1)>0
  2  /

        ID TEXT_DATA
---------- ------------------------------------------------------------
         1 Mr Richard Stephen Jones comes to house
         3 Mr Jones Richard Stephen comes to house
         4 Mr Richrd Stephen Jones comes to house
         5 Mr Richrd Stephn Jones comes to house

SCOTT@orcl_11g> select * from docs where contains(text_data, '!Richrd !Stephn', 1)>0
  2  /

        ID TEXT_DATA
---------- ------------------------------------------------------------
         1 Mr Richard Stephen Jones comes to house
         3 Mr Jones Richard Stephen comes to house
         4 Mr Richrd Stephen Jones comes to house
         5 Mr Richrd Stephn Jones comes to house

SCOTT@orcl_11g> select * from docs where contains(text_data, 'Richrd !Stephn', 1)>0
  2  /

        ID TEXT_DATA
---------- ------------------------------------------------------------
         4 Mr Richrd Stephen Jones comes to house
         5 Mr Richrd Stephn Jones comes to house

SCOTT@orcl_11g> select * from docs where contains(text_data, 'fuzzy(Richrd, 50, 5, weight)', 1) > 0
  2  /

        ID TEXT_DATA
---------- ------------------------------------------------------------
         1 Mr Richard Stephen Jones comes to house
         3 Mr Jones Richard Stephen comes to house
         4 Mr Richrd Stephen Jones comes to house
         5 Mr Richrd Stephn Jones comes to house

SCOTT@orcl_11g> --  AND
SCOTT@orcl_11g> select * from docs where contains(text_data, 'Jones AND Richard AND Stephen', 1)>0
  2  /

        ID TEXT_DATA
---------- ------------------------------------------------------------
         1 Mr Richard Stephen Jones comes to house
         3 Mr Jones Richard Stephen comes to house

SCOTT@orcl_11g> --- Test French data
SCOTT@orcl_11g> select * from docs where contains(text_data, 'wxcvbn', 1)>0
  2  /

        ID TEXT_DATA
---------- ------------------------------------------------------------
         6 Azerty wxcvbn

SCOTT@orcl_11g> -- additional tests
SCOTT@orcl_11g> select * from docs where contains(text_data, 'r% s%', 1)>0
  2  /

        ID TEXT_DATA
---------- ------------------------------------------------------------
         1 Mr Richard Stephen Jones comes to house
         2 Mr R.S. Jones comes to house
         3 Mr Jones Richard Stephen comes to house
         4 Mr Richrd Stephen Jones comes to house
         5 Mr Richrd Stephn Jones comes to house

SCOTT@orcl_11g> select * from docs where contains(text_data, '$chase', 1)>0
  2  /

        ID TEXT_DATA
---------- ------------------------------------------------------------
         7 The cats chased the mice.
         9 The cat was chasing a mouse.

SCOTT@orcl_11g> alter session set nls_language = 'FRENCH'
  2  /

Session altered.

SCOTT@orcl_11g> select * from docs where contains(text_data, '$chasser', 1)>0
  2  /

        ID TEXT_DATA
---------- ------------------------------------------------------------
         8 Les chats ont chass les souris.
        10 Le chat chassait une souris.

SCOTT@orcl_11g>






Previous Topic: ctxsys creation in 10g
Next Topic: How to pass the value to a contains query using a parameter
Goto Forum:
  


Current Time: Tue Sep 30 11:13:47 CDT 2014

Total time taken to generate the page: 0.09069 seconds