Re: Searching addresses

From: Andy Bowles <andy.bowles_at_virgin.net>
Date: Thu, 07 Mar 2002 19:54:38 GMT
Message-ID: <%zPh8.604$kF6.63521_at_news-binary.blueyonder.co.uk>


> "spare_brain" <spare_brain_at_yahoo.com> wrote in message
> news:a5k366$mtg19_at_kcweb01.netnews.att.com...
> > Folks,
> >
> > I am trying to find examples of SQL snippets that shows an address
 search
 as
> > follows. Such searches are very common in map searching.
> >
> > For example, the "Address_Line" attribute has entries of the form
> >
> > 100 N. Columbia Pky
> > 100 N. Columbia Parkway
> > 100 North Columbia Pky
> > 100 North Columbia Parkway
> > 222 North Columbia Pky
> >
> > Now, my search query should be such that when the user enters "N Col
 Pky",
> > it should fetch all of these entries.
> >
> > Could someone provide some pointers, and/or other leads into finding
 similar
> > examples?
> >
> > Any help would be great and save a SQLed soul.....so Thanks!
> >
> > sb
> > -----------------------------
> > "..Life is like a database query, you never know what'cha gonna get.."
>

"Steve M" <steve.mcdaniels_at_vuinteractive.com> wrote in message news:a63bk3$6i6$1_at_spiney.sierra.com...
> tip:
> 1) BUY some good software which is U.S. Postal Service certified (CASS).
> 2) Standardize all addresses using this software prior to insertion into
 the
> database
> for example, the human eye can see that your 100 N COLUMBIA PKWY
> addresses are really all the same logical address, so you shouldn't
 have
> four records
>
> There are way too many variations on the way people submit their
 addresses,
> particularly with PO Box and PMBs
> Not all addresses are valid / complete -- good software can correct some
 of
> this for you.

Standardising the data outside the database is certainly best, either using a proprietory product or by writing something yourself which replaces any known abbrevations, standardises the format of initials and removes any unwanted punctuation. Writing the code for this is not particularly difficult; the hard part is thinking of all the different ways that people might abbreviate or otherwise mangle the words that appear in addresses, and making sure that your application doesn't itself make things worse (for instance, by changing "N Rockerfeller House" to "North Rockerfeller House"). Obviously a home-made application like this won't add missing address elements, whereas a proprietory product probably will.

Next best is to clean it up in situ. You could write a procedure that something like what I describe above, and run it from time to time.

Finally, you could try something like this:

create or replace function search_string(in_string in varchar2) return varchar2
is

out_string varchar2(500);
current varchar2(1);
next varchar2(1);
offset int;

begin
out_string := '%';

for offset in 1 .. length(in_string) loop   current := substr(in_string, offset, 1);   next := substr(in_string, offset + 1, 1);

  if current between 'A' and 'Z' or

     current between 'a' and 'z' then
    out_string := out_string||current||'%';   elsif current between '0' and '9' then     out_string := out_string||current;

    if next not between '0' and '9' then       out_string := out_string||'%';
    end if;
  end if;

end loop;

return out_string;

end;
/

Now search_string('100 N. Columbia Pky, W.Side') produces "%100%N%C%o%l%u%m%b%i%a%P%k%y%W%S%i%d%e%". Hence you could write queries containing something like this:

where external_address_field_1 like search_string(database_address_field_1)
   or external_address_field_1 like search_string(database_address_field_2)
   or database_address_field_1 like search_string(external_address_field_1)
   or database_address_field_2 like search_string(external_address_field_1)

This approach will work if either of the address lines is well formed, but not if each contains a different abbreviation. Obviously, it is likely to produce a fair number of false positives, and you will need to create indexes on each address field and on search_string(each address field).

Doing something like this may be necessary even when the data on your database is clean, because what you're searching for may well not be.

There are lots of possible improvements to the above method. For instance:

- Make the comparisons case-insensitive.
- Make search_string() replace common abbreviations.
- Make search_string() put wildcards between letters only when the word is
fairly short, or consists only of capitals or only of consonants.

Somewhere around here one should probably abandon the idea of using PL/SQL for this purpose.

Andy Bowles Received on Thu Mar 07 2002 - 20:54:38 CET

Original text of this message