Oracle FAQ Your Portal to the Oracle Knowledge Grid
HOME | ASK QUESTION | ADD INFO | SEARCH | E-MAIL US
 

Home -> Community -> Usenet -> c.d.o.server -> Re: Fuzzy search

Re: Fuzzy search

From: Al Reid <areidjr_at_reidHyphenhome.com>
Date: Thu, 22 Jan 2004 21:11:01 GMT
Message-ID: <F%WPb.16288$LM4.10392@nwrdny03.gnilink.net>


<ctcgag_at_hotmail.com> wrote in message
news:20040122150457.752$Go_at_newsreader.com...

> "Al Reid" <areidjr_at_nospamhotmail.com> wrote:

> > > > > >
> > > > > > They want to be able to retrieve the record if the type in any
of
> > > > > > the following:
> > > > > >
> > > > > > A. B. Corp
> > > > > > A.B. Corp
> > > > > > AB Corp
> > > > > > A.B Corp, etc.
> > >
> > > uh, that's pretty fuzzy. What happened to the "C" in "A B C Corp"?
Do
> > > they want to also find "NBC", "ABC", and "CBS"? If they want "George
> > > Washington" but they accidentally spell it "Thomas Jefferson", do they
> > > want you magically correct that, also?
> > >
> >
> > Sorry, my bad. I meant the entry in the database is 'A B Corp'
> > I guess I was a little frustrated when I posted this.
>
> Ah, that may be much less fuzzy, then.  How about a fbi using a
> canonicalization function which removes all non-letter characters (and
> converts them all to upper while it is at it)?  Of course, you'd still
have
> to handle (or forbid) situations where the name (after transformation) is
> non-unique.  Then all the above would simply become "ABCORP".
>

> >
> > > > > > I currently use SPs to retrieve the records from a VB program.
> > > > > > Is there something I could add to the SP to provide this
> > > > > > functionality without severely effecting performance?
> > >
> > > It's effect on performance would depend on how large the customer
table
> > > is. For some systems, doing FTS of the customer table 10 times a
minute
> > > would have no meaningful impact. For others, it would be fatal.
> > > Strictly speaking, it may not have to do a FTS (for example, if you
> > > always insist that at the list the first letter is not fuzzy), but I
> > > think that's a good estimate to use for performance impact.
> > >
> >
> > There are currently 626000 customers in the table.
>
> If using a canonicalization function is good enough for them,
> then the size doesn't really matter (except to the extent that name
> collisions occur).  If they want something fancier, like finding the
> minimum Levenshtein edit distance, then with a table that size you have
> your work cut out for you.
>
>

Thanks, I will explore the canonicalization function and see where it leads.

>
> Xho
>
> -- 
> -------------------- http://NewsReader.Com/ --------------------
> Usenet Newsgroup Service              New Rate! $9.95/Month 50GB
Received on Thu Jan 22 2004 - 15:11:01 CST

Original text of this message

HOME | ASK QUESTION | ADD INFO | SEARCH | E-MAIL US