Home » SQL & PL/SQL » SQL & PL/SQL » tough requirement string comparison (oracle 10g)
tough requirement string comparison [message #312448] Tue, 08 April 2008 21:36 Go to next message
srinivas.k2005
Messages: 306
Registered: August 2006
Senior Member
Hi,

I have requirement wherein i need to compare two strings (with mulitple words) and it should return the %(percentage) of comparison.
e.g. "oracle infotech" and "infotech oracle" are 100% match

Do we have any oracle builtins to compare ?

Please let me know if anyone has any idea on above requirement.

Thanks,
Srinivas

[Updated on: Tue, 08 April 2008 21:36]

Report message to a moderator

Re: tough requirement string comparison [message #312460 is a reply to message #312448] Tue, 08 April 2008 23:44 Go to previous messageGo to next message
Barbara Boehmer
Messages: 8006
Registered: November 2002
Location: California, USA
Senior Member
Are you just going to be comparing two strings at a time or pairs of strings from columns in a table or compare one string to each value in a column or what? There are various comparison methods.

If your strings are stored in a table you can create Oracle text indexes on them and compare using contains clauses that returns a score.

There is a utl_match.edit_distance_similarity function that returns a percentage of similarity, but it may not be the formula that you are looking for and the order of words counts, so 'oracle infotech' and 'infotech oracle' would not be a 100% match.

How is your percentage defined? If you consider it as the number of matching words divided by the number of distinct words times 100, then you can create your own function something like the example below. It puts the words from each string into a collection then selects the intersection and the union of those two sets, in order to calculate the percentage.

SCOTT@orcl_11g> CREATE OR REPLACE TYPE varchar2_tt AS TABLE OF VARCHAR2 (2000);
  2  /

Type created.

SCOTT@orcl_11g> CREATE OR REPLACE FUNCTION percentage_of_matching_words
  2    (p_string1   IN VARCHAR2,
  3  	p_string2   IN VARCHAR2)
  4    RETURN NUMBER
  5  AS
  6    v_string1	VARCHAR2 (2000) := p_string1 || ' ';
  7    v_string2	VARCHAR2 (2000) := p_string2 || ' ';
  8    v_words1 	varchar2_tt := varchar2_tt();
  9    v_words2 	varchar2_tt := varchar2_tt();
 10    v_position	NUMBER;
 11    v_percent	NUMBER;
 12  BEGIN
 13    LOOP
 14  	 EXIT WHEN v_string1 IS NULL;
 15  	 v_position := INSTR (v_string1, ' ' );
 16  	 v_words1.EXTEND;
 17  	 v_words1 (v_words1.COUNT) := LTRIM (RTRIM (SUBSTR (v_string1, 1, v_position - 1)));
 18  	 v_string1 := SUBSTR (v_string1, v_position + 1);
 19    END LOOP;
 20    LOOP
 21  	 EXIT WHEN v_string2 IS NULL;
 22  	 v_position := INSTR (v_string2, ' ' );
 23  	 v_words2.EXTEND;
 24  	 v_words2 (v_words2.COUNT) := LTRIM (RTRIM (SUBSTR (v_string1, 1, v_position - 1)));
 25  	 v_string2 := SUBSTR (v_string2, v_position + 1);
 26    END LOOP;
 27    WITH s1 AS (SELECT * FROM TABLE (CAST (in_list (p_string1) AS varchar2_tt))),
 28  	    s2 AS (SELECT * FROM TABLE (CAST (in_list (p_string2) AS varchar2_tt)))
 29    SELECT (i.cnt / u.cnt) * 100
 30    INTO   v_percent
 31    FROM   (SELECT COUNT (*) cnt FROM (SELECT * FROM s1 INTERSECT SELECT * FROM s2)) i,
 32  	      (SELECT COUNT (*) cnt FROM (SELECT * FROM s1 UNION SELECT * FROM s2)) u;
 33    RETURN v_percent;
 34  END percentage_of_matching_words;
 35  /

Function created.

SCOTT@orcl_11g> SHOW ERRORS
No errors.
SCOTT@orcl_11g> SELECT percentage_of_matching_words
  2  	      ('oracle infotech',
  3  	       'infotech oracle')
  4  FROM DUAL
  5  /

PERCENTAGE_OF_MATCHING_WORDS('ORACLEINFOTECH','INFOTECHORACLE')
---------------------------------------------------------------
                                                            100

SCOTT@orcl_11g> SELECT percentage_of_matching_words
  2  	      ('Ship the documents to the Feds',
  3  	       'Rip the documents to shreds')
  4  FROM   DUAL
  5  /

PERCENTAGE_OF_MATCHING_WORDS('SHIPTHEDOCUMENTSTOTHEFEDS','RIPTHEDOCUMENTSTOSHRED
--------------------------------------------------------------------------------
                                                                      42.8571429

SCOTT@orcl_11g> 


Re: tough requirement string comparison [message #312494 is a reply to message #312460] Wed, 09 April 2008 01:03 Go to previous messageGo to next message
Michel Cadot
Messages: 59811
Registered: March 2007
Location: Nanterre, France, http://...
Senior Member
Account Moderator
Same thing in SQL:
SQL> with
  2    data as (
  3      select 0 id, 'oracle infotech' val from dual
  4      union all
  5      select 1, 'infotech oracle' from dual
  6      union all
  7      select 2, 'Ship the documents to the Feds' from dual
  8      union all
  9      select 3, 'Rip the documents to shreds' from dual
 10    ),
 11    lines as (
 12      select level line 
 13      from dual 
 14      connect by level <= 
 15        ( select max(length(val)-length(replace(val,' ','')))+1 from data )
 16    ),
 17    words as (
 18      select id, val,
 19             substr(val,
 20                    instr(' '||val||' ',' ',1,line),
 21                    instr(' '||val||' ',' ',1,line+1)-instr(' '||val||' ',' ',1,line)-1
 22                   ) word
 23      from data, lines
 24      where line <= length(val)-length(replace(val,' ',''))+1
 25    ),
 26    nb_words as (
 27      select id, count(*) nb_words
 28      from ( select trunc(id/2) id, word from words where mod(id,2) = 0
 29             union
 30             select trunc(id/2) id, word from words where mod(id,2) = 1 )
 31      group by id
 32    ),
 33    common_words as (
 34      select id, count(*) nb_common
 35      from ( select trunc(id/2) id, word from words where mod(id,2) = 0
 36             intersect
 37             select trunc(id/2) id, word from words where mod(id,2) = 1 )
 38      group by id
 39    )
 40  select d1.val val1, d2.val val2, 100*nb_common/nb_words pct_match
 41  from nb_words, common_words, data d1, data d2
 42  where common_words.id = nb_words.id
 43    and trunc(d1.id/2) = nb_words.id
 44    and mod(d1.id,2) = 0
 45    and mod(d2.id,2) = 1
 46    and trunc(d2.id/2) = nb_words.id
 47  order by nb_words.id
 48  /
VAL1                           VAL2                            PCT_MATCH
------------------------------ ------------------------------ ----------
oracle infotech                infotech oracle                       100
Ship the documents to the Feds Rip the documents to shreds    42.8571429

2 rows selected.

Regards
Michel
Re: tough requirement string comparison [message #313830 is a reply to message #312448] Mon, 14 April 2008 15:05 Go to previous messageGo to next message
Frank_Zhou
Messages: 5
Registered: February 2008
Location: Braintree , MA
Junior Member
An alternative SQL for comparing 2 input strings:

SQL> select max(old_str) str1, min(old_str) str2, count(distinct str)/max(str_cnt) *100 as mat_pct
  2  FROM
  3  (select old_str, str, count(distinct str) over ( ) str_cnt,
  4   count(distinct id) over(partition by str) counter
  5  from 
  6   (select id, old_str, doc.extract('/l/text()').getstringVal()  str  
  7    from (select 1 as id, 'Ship the documents to the Feds' old_str from dual
  8          union all
  9          select 2, 'Rip the documents to shreds' from dual) t,
 10     TABLE(xmlSequence(extract(XMLType('<doc><l>'||
 11          replace(t.old_str,' ','</l><l>')||'</l></doc>'),'/doc/l'))) doc
 12   )
 13  )
 14  where counter = 2;

STR1                             STR2                              MAT_PCT
------------------------------   ------------------------------ ----------
Ship the documents to the Feds   Rip the documents to shreds    42.8571429


Frank
Re: tough requirement string comparison [message #313949 is a reply to message #312448] Tue, 15 April 2008 05:30 Go to previous messageGo to next message
srinivas.k2005
Messages: 306
Registered: August 2006
Senior Member
Thanks For all your reply..
Re: tough requirement string comparison [message #501974 is a reply to message #312460] Fri, 01 April 2011 20:05 Go to previous messageGo to next message
Barbara Boehmer
Messages: 8006
Registered: November 2002
Location: California, USA
Senior Member
I know this is an old thread, but I received an email pointing out an error and some missing code. On line 24 of the percentage_of_matching_words function, v_string1 should be v_string2. Also, I initially used Tom Kyte's in_list function for testing, then added code to the function to do the same thing, but forgot to substitute it for the in_list function. So, I have provided the corrected code and demo below.

SCOTT@orcl_11gR2> CREATE OR REPLACE TYPE varchar2_tt AS TABLE OF VARCHAR2 (2000);
  2  /

Type created.

SCOTT@orcl_11gR2> CREATE OR REPLACE FUNCTION percentage_of_matching_words
  2    (p_string1   IN VARCHAR2,
  3  	p_string2   IN VARCHAR2)
  4    RETURN NUMBER
  5  AS
  6    v_string1     VARCHAR2 (2000) := p_string1 || ' ';
  7    v_string2     VARCHAR2 (2000) := p_string2 || ' ';
  8    v_words1      varchar2_tt := varchar2_tt();
  9    v_words2      varchar2_tt := varchar2_tt();
 10    v_position    NUMBER;
 11    v_percent     NUMBER;
 12  BEGIN
 13    LOOP
 14  	 EXIT WHEN v_string1 IS NULL;
 15  	 v_position := INSTR (v_string1, ' ' );
 16  	 v_words1.EXTEND;
 17  	 v_words1 (v_words1.COUNT) := LTRIM (RTRIM (SUBSTR (v_string1, 1, v_position - 1)));
 18  	 v_string1 := SUBSTR (v_string1, v_position + 1);
 19    END LOOP;
 20    LOOP
 21  	 EXIT WHEN v_string2 IS NULL;
 22  	 v_position := INSTR (v_string2, ' ' );
 23  	 v_words2.EXTEND;
 24  	 v_words2 (v_words2.COUNT) := LTRIM (RTRIM (SUBSTR (v_string2, 1, v_position - 1)));
 25  	 v_string2 := SUBSTR (v_string2, v_position + 1);
 26    END LOOP;
 27    WITH s1 AS (SELECT * FROM TABLE (CAST (v_words1 AS varchar2_tt))),
 28  	    s2 AS (SELECT * FROM TABLE (CAST (v_words2 AS varchar2_tt)))
 29    SELECT (i.cnt / u.cnt) * 100
 30    INTO   v_percent
 31    FROM   (SELECT COUNT (*) cnt FROM (SELECT * FROM s1 INTERSECT SELECT * FROM s2)) i,
 32  	      (SELECT COUNT (*) cnt FROM (SELECT * FROM s1 UNION SELECT * FROM s2)) u;
 33    RETURN v_percent;
 34  END percentage_of_matching_words;
 35  /

Function created.

SCOTT@orcl_11gR2> SELECT percentage_of_matching_words
  2  	   ('oracle infotech',
  3  	    'infotech oracle')
  4  FROM DUAL
  5  /

PERCENTAGE_OF_MATCHING_WORDS('ORACLEINFOTECH','INFOTECHORACLE')
---------------------------------------------------------------
                                                            100

1 row selected.

SCOTT@orcl_11gR2> SELECT percentage_of_matching_words
  2  	   ('Ship the documents to the Feds',
  3  	    'Rip the documents to shreds')
  4  FROM   DUAL
  5  /

PERCENTAGE_OF_MATCHING_WORDS('SHIPTHEDOCUMENTSTOTHEFEDS','RIPTHEDOCUMENTSTOSHRED
--------------------------------------------------------------------------------
                                                                      42.8571429

1 row selected.

SCOTT@orcl_11gR2>

Re: tough requirement string comparison [message #502307 is a reply to message #501974] Tue, 05 April 2011 13:57 Go to previous message
supercox
Messages: 4
Registered: April 2011
Location: NOVA
Junior Member
Barbara,

Thank you for fixing this and posting it back here.
I have implemented it with my project and it is working wonderfully!
We needed to change the percentage from the long format provided ( 85.71428571428571428571428571428571428571 ) to ( 85.71 ).
This was accomplished by changing v_percent from NUMBER to DECIMAL(12,2) in line 11.

Thank you
Roger
Previous Topic: GROUP BY on an expression (and not a field in the table). Is it possible?
Next Topic: send mail from procedure with attach from file system
Goto Forum:
  


Current Time: Thu Nov 27 11:31:26 CST 2014

Total time taken to generate the page: 0.08659 seconds