Path: dp-news.maxwell.syr.edu!spool.maxwell.syr.edu!drn.maxwell.syr.edu!news.maxwell.syr.edu!postnews.google.com!z14g2000cwz.googlegroups.com!not-for-mail
From: "Barbara Boehmer" <baboehme@hotmail.com>
Newsgroups: comp.databases.oracle.server,microsoft.public.sqlserver.server
Subject: Re: How to do a "fuzzy" or approximate matching of strings in a SQL where clause
Date: 8 Aug 2005 13:55:59 -0700
Organization: http://groups.google.com
Lines: 289
Message-ID: <1123534559.116221.197630@z14g2000cwz.googlegroups.com>
References: <1123514686.757953.315070@g43g2000cwa.googlegroups.com>
   <1123511868.e647dc33c74835f95c58a2bd3345f028@bubbanews>
   <1123522291.474278.110070@g14g2000cwa.googlegroups.com>
NNTP-Posting-Host: 66.174.90.201
Mime-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
X-Trace: posting.google.com 1123534566 493 127.0.0.1 (8 Aug 2005 20:56:06 GMT)
X-Complaints-To: groups-abuse@google.com
NNTP-Posting-Date: Mon, 8 Aug 2005 20:56:06 +0000 (UTC)
User-Agent: G2/0.2
Complaints-To: groups-abuse@google.com
Injection-Info: z14g2000cwz.googlegroups.com; posting-host=66.174.90.201;
   posting-account=nheuaQwAAABKxf1p2deDG3dphbeyn7_r
Xref: dp-news.maxwell.syr.edu comp.databases.oracle.server:249132 microsoft.public.sqlserver.server:340523

There are many options, most of which have already been mentioned by
others, which can be used individually, or in combination.  You can use
built-in functions like LIKE and INSTR or you can use Oracle Text,
which requires creation of an index and periodic synchronization of the
index, but is an excellent combination of efficiency and flexibility,
with various search options and scoring that can be used for ordering
or further limiting.  You can also use the Levenshtein distance (edit
distance) formula, which can also be used for ordering and will find
wrong spellings and things that the others do not, but tends to be slow
on large datasets.  The Levenshtein distance formula is best used for
ordering and further limiting result sets after other methods have been
applied.  What combination you choose depends on your exact needs.  I
have provided a small sampling of some of these methods below for your
perusal.

scott@ORA92> -- test data:
scott@ORA92> SELECT * FROM your_table
  2  /

YOUR_COLUMN
------------------------------------------------------------
SQL server help request
Re: Re: SQL Server help   request - followup
FW:  forwarded for your attention:  SQL Server help request
Help request SQL Server
Help requested regarding SQL Server
Requesting for some Help with SQL Server
SQL Server Help Required
SQL Server Help needed
Some other request for help with something else

9 rows selected.

scott@ORA92>
scott@ORA92> -- like (built-in function):
scott@ORA92> SELECT * FROM your_table
  2  WHERE  UPPER (your_column) LIKE '%SQL SERVER%'
  3  AND    UPPER (your_column) LIKE '%HELP%'
  4  AND    UPPER (your_column) LIKE '%REQUEST%'
  5  /

YOUR_COLUMN
------------------------------------------------------------
SQL server help request
Re: Re: SQL Server help   request - followup
FW:  forwarded for your attention:  SQL Server help request
Help request SQL Server
Help requested regarding SQL Server
Requesting for some Help with SQL Server

6 rows selected.

scott@ORA92>
scott@ORA92> -- instr (built-in function):
scott@ORA92> SELECT * FROM your_table
  2  WHERE  INSTR (UPPER (your_column), 'SQL SERVER') > 0
  3  AND    INSTR (UPPER (your_column), 'HELP') > 0
  4  AND    INSTR (UPPER (your_column), 'REQUEST') > 0
  5  /

YOUR_COLUMN
------------------------------------------------------------
SQL server help request
Re: Re: SQL Server help   request - followup
FW:  forwarded for your attention:  SQL Server help request
Help request SQL Server
Help requested regarding SQL Server
Requesting for some Help with SQL Server

6 rows selected.

scott@ORA92>
scott@ORA92> -- context:
scott@ORA92> -- probably best combination of performance and
flexibility
scott@ORA92> -- the following only demonstrates a few options
scott@ORA92> -- requires creation of index:
scott@ORA92> CREATE INDEX your_index
  2  ON your_table (your_column)
  3  INDEXTYPE IS CTXSYS.CONTEXT
  4  /

Index created.

scott@ORA92> -- requires periodic synchronization of index
scott@ORA92> -- which can be scheduled to run automatically:
scott@ORA92> EXECUTE ctx_ddl.sync_index ('YOUR_INDEX')

PL/SQL procedure successfully completed.

scott@ORA92> -- set define off so ampersand can be used:
scott@ORA92> SET DEFINE OFF
scott@ORA92> SELECT your_column, SCORE (1) FROM your_table
  2  WHERE  CONTAINS (your_column, 'SQL & SERVER & HELP & REQUEST', 1)
> 0
  3  ORDER  BY SCORE (1) DESC
  4  /

YOUR_COLUMN                                                    SCORE(1)
------------------------------------------------------------ ----------
Help request SQL Server                                               3
FW:  forwarded for your attention:  SQL Server help request           3
Re: Re: SQL Server help   request - followup                          3
SQL server help request                                               3

scott@ORA92> -- with fuzzy matching (old syntax, newer has more
options):
scott@ORA92> SELECT your_column, SCORE (1) FROM your_table
  2  WHERE  CONTAINS (your_column, '?SQL & ?SERVER & ?HELP & ?REQUEST',
1) > 0
  3  ORDER  BY SCORE (1) DESC
  4  /

YOUR_COLUMN                                                    SCORE(1)
------------------------------------------------------------ ----------
Help requested regarding SQL Server                                   3
Help request SQL Server                                               3
FW:  forwarded for your attention:  SQL Server help request           3
Re: Re: SQL Server help   request - followup                          3
SQL server help request                                               3

scott@ORA92>
scott@ORA92> -- Levenshtein distance:
scott@ORA92> -- (requires creation of user-defined function by just
running the code below)
scott@ORA92> -- finds wrong spellings and things that others don't, but
tends to be slow on large datasets
scott@ORA92> -- is best used in combination with other methods to order
/ further limit a small result set
scott@ORA92> CREATE OR REPLACE FUNCTION ld -- Levenshtein distance
  2    (p_source_string   IN VARCHAR2,
  3  	p_target_string   IN VARCHAR2)
  4    RETURN		     NUMBER
  5    DETERMINISTIC
  6  AS
  7    v_length_of_source    NUMBER := NVL (LENGTH (p_source_string),
0);
  8    v_length_of_target    NUMBER := NVL (LENGTH (p_target_string),
0);
  9    TYPE mytabtype IS     TABLE OF NUMBER INDEX BY BINARY_INTEGER;
 10    column_to_left	     mytabtype;
 11    current_column	     mytabtype;
 12    v_cost		     NUMBER := 0;
 13  BEGIN
 14    IF v_length_of_source = 0 THEN
 15  	 RETURN v_length_of_target;
 16    ELSIF v_length_of_target = 0 THEN
 17  	 RETURN v_length_of_source;
 18    ELSE
 19  	 FOR j IN 0 .. v_length_of_target LOOP
 20  	   column_to_left(j) := j;
 21  	 END LOOP;
 22  	 FOR i IN 1.. v_length_of_source LOOP
 23  	   current_column(0) := i;
 24  	   FOR j IN 1 .. v_length_of_target LOOP
 25  	     IF SUBSTR (p_source_string, i, 1) =
 26  		SUBSTR (p_target_string, j, 1)
 27  	     THEN v_cost := 0;
 28  	     ELSE v_cost := 1;
 29  	     END IF;
 30  	     current_column(j) := LEAST (current_column(j-1) + 1,
 31  					 column_to_left(j) + 1,
 32  					 column_to_left(j-1) + v_cost);
 33  	   END LOOP;
 34  	   FOR j IN 0 .. v_length_of_target  LOOP
 35  	     column_to_left(j) := current_column(j);
 36  	   END LOOP;
 37  	 END LOOP;
 38    END IF;
 39    RETURN current_column(v_length_of_target);
 40  END ld;
 41  /

Function created.

scott@ORA92> SELECT your_column,
  2  	    ld (UPPER (your_column), 'SQL SERVER HELP REQUEST') lev_dist
  3  FROM   your_table
  4  ORDER  BY lev_dist
  5  /

YOUR_COLUMN                                                    LEV_DIST
------------------------------------------------------------ ----------
SQL server help request                                               0
SQL Server Help Required                                              4
SQL Server Help needed                                                5
Help request SQL Server                                              16
Re: Re: SQL Server help   request - followup                         21
Help requested regarding SQL Server                                  26
Requesting for some Help with SQL Server                             29
Some other request for help with something else                      32
FW:  forwarded for your attention:  SQL Server help request          36

9 rows selected.

scott@ORA92> -- limit by distance:
scott@ORA92> SELECT *
  2  FROM   your_table
  3  WHERE  ld (UPPER (your_column), 'SQL SERVER HELP REQUEST') <= 5
  4  /

YOUR_COLUMN
------------------------------------------------------------
SQL server help request
SQL Server Help Required
SQL Server Help needed

scott@ORA92> -- limit by number of rows:
scott@ORA92> SELECT your_column
  2  FROM   (SELECT your_column,
  3  		    ld (UPPER (your_column), 'SQL SERVER HELP REQUEST') lev_dist
  4  	     FROM   your_table
  5  	     ORDER  BY lev_dist)
  6  WHERE  ROWNUM <= 8
  7  /

YOUR_COLUMN
------------------------------------------------------------
SQL server help request
SQL Server Help Required
SQL Server Help needed
Help request SQL Server
Re: Re: SQL Server help   request - followup
Help requested regarding SQL Server
Requesting for some Help with SQL Server
Some other request for help with something else

8 rows selected.

scott@ORA92> -- limit by distance or rows:
scott@ORA92> SELECT your_column
  2  FROM   (SELECT your_column,
  3  		    ld (UPPER (your_column), 'SQL SERVER HELP REQUEST') lev_dist
  4  	     FROM   your_table
  5  	     ORDER  BY lev_dist)
  6  WHERE  ROWNUM <= 8
  7  OR     lev_dist <= 5
  8  /

YOUR_COLUMN
------------------------------------------------------------
SQL server help request
SQL Server Help Required
SQL Server Help needed
Help request SQL Server
Re: Re: SQL Server help   request - followup
Help requested regarding SQL Server
Requesting for some Help with SQL Server
Some other request for help with something else

8 rows selected.

scott@ORA92> -- limit by distance and rows:
scott@ORA92> SELECT your_column
  2  FROM   (SELECT your_column,
  3  		    ld (UPPER (your_column), 'SQL SERVER HELP REQUEST') lev_dist
  4  	     FROM   your_table
  5  	     WHERE  ld (UPPER (your_column), 'SQL SERVER HELP REQUEST')
<= 5
  6  	     ORDER  BY lev_dist)
  7  WHERE  ROWNUM <= 8
  8  /

YOUR_COLUMN
------------------------------------------------------------
SQL server help request
SQL Server Help Required
SQL Server Help needed

scott@ORA92>
scott@ORA92> -- example of combination of methods:
scott@ORA92> SELECT * FROM your_table
  2  WHERE  UPPER (your_column) LIKE '%SQL SERVER%'
  3  ORDER  BY ld (UPPER (your_column), 'SQL SERVER HELP REQUEST')
  4  /

YOUR_COLUMN
------------------------------------------------------------
SQL server help request
SQL Server Help Required
SQL Server Help needed
Help request SQL Server
Re: Re: SQL Server help   request - followup
Help requested regarding SQL Server
Requesting for some Help with SQL Server
FW:  forwarded for your attention:  SQL Server help request

8 rows selected.

