Re: Searching addresses

From: Steve M <steve.mcdaniels_at_vuinteractive.com>
Date: Thu, 14 Mar 2002 13:57:54 -0800
Message-ID: <a6r6bm$pvo$1_at_spiney.sierra.com>


If you choose to write your own stuff, you can subscribe to the U.S. Postal Service.
For a small fee, they will supply you with CD media containing all valid steets, city, states, zips, etc
on a monthly basis (new streets are continually added/updated/deleted).

As an aside:
I did this, both on MS-SQL and Oracle (each on NT)

There are really only two parts:

  1. Address parsing and standardization (changing SOUTH to S, Avenue to AVE, etc) -- non-trivial
  2. Validating both street/city and house number/appartment number range to obtain ZIP+4 (trivial)

If your work is good enough, you can petition the USPS for CASS certification
(CASS certification is, as Martha would say: "a good thing")

Having gone through all of that, we now use First Logic's data hygiene, address
standardization (CASS), merge, and mover identification (18-month NCOA)

"Andy Bowles" <andy.bowles_at_virgin.net> wrote in message news:%zPh8.604$kF6.63521_at_news-binary.blueyonder.co.uk...
>
> > "spare_brain" <spare_brain_at_yahoo.com> wrote in message
> > news:a5k366$mtg19_at_kcweb01.netnews.att.com...
> > > Folks,
> > >
> > > I am trying to find examples of SQL snippets that shows an address
 search
 as
> > > follows. Such searches are very common in map searching.
> > >
> > > For example, the "Address_Line" attribute has entries of the form
> > >
> > > 100 N. Columbia Pky
> > > 100 N. Columbia Parkway
> > > 100 North Columbia Pky
> > > 100 North Columbia Parkway
> > > 222 North Columbia Pky
> > >
> > > Now, my search query should be such that when the user enters "N Col
 Pky",
> > > it should fetch all of these entries.
> > >
> > > Could someone provide some pointers, and/or other leads into finding
 similar
> > > examples?
> > >
> > > Any help would be great and save a SQLed soul.....so Thanks!
> > >
> > > sb
> > > -----------------------------
> > > "..Life is like a database query, you never know what'cha gonna get.."
> >
>
> "Steve M" <steve.mcdaniels_at_vuinteractive.com> wrote in message
> news:a63bk3$6i6$1_at_spiney.sierra.com...
> > tip:
> > 1) BUY some good software which is U.S. Postal Service certified
 (CASS).
> > 2) Standardize all addresses using this software prior to insertion into
 the
> > database
> > for example, the human eye can see that your 100 N COLUMBIA PKWY
> > addresses are really all the same logical address, so you shouldn't
 have
> > four records
> >
> > There are way too many variations on the way people submit their
 addresses,
> > particularly with PO Box and PMBs
> > Not all addresses are valid / complete -- good software can correct some
 of
> > this for you.
>
> Standardising the data outside the database is certainly best, either
 using
> a proprietory product or by writing something yourself which replaces any
> known abbrevations, standardises the format of initials and removes any
> unwanted punctuation. Writing the code for this is not particularly
> difficult; the hard part is thinking of all the different ways that people
> might abbreviate or otherwise mangle the words that appear in addresses,
 and
> making sure that your application doesn't itself make things worse (for
> instance, by changing "N Rockerfeller House" to "North Rockerfeller
 House").
> Obviously a home-made application like this won't add missing address
> elements, whereas a proprietory product probably will.
>
> Next best is to clean it up in situ. You could write a procedure that
> something like what I describe above, and run it from time to time.
>
> Finally, you could try something like this:
>
> create or replace function search_string(in_string in varchar2)
> return varchar2
> is
>
> out_string varchar2(500);
> current varchar2(1);
> next varchar2(1);
> offset int;
>
> begin
> out_string := '%';
>
> for offset in 1 .. length(in_string) loop
> current := substr(in_string, offset, 1);
> next := substr(in_string, offset + 1, 1);
>
> if current between 'A' and 'Z' or
> current between 'a' and 'z' then
> out_string := out_string||current||'%';
> elsif current between '0' and '9' then
> out_string := out_string||current;
>
> if next not between '0' and '9' then
> out_string := out_string||'%';
> end if;
> end if;
>
> end loop;
>
> return out_string;
>
> end;
> /
>
> Now search_string('100 N. Columbia Pky, W.Side') produces
> "%100%N%C%o%l%u%m%b%i%a%P%k%y%W%S%i%d%e%". Hence you could write queries
> containing something like this:
>
> where external_address_field_1 like

 search_string(database_address_field_1)
> or external_address_field_1 like

 search_string(database_address_field_2)
> or database_address_field_1 like

 search_string(external_address_field_1)
> or database_address_field_2 like

 search_string(external_address_field_1)
>
> This approach will work if either of the address lines is well formed, but
> not if each contains a different abbreviation. Obviously, it is likely to
> produce a fair number of false positives, and you will need to create
> indexes on each address field and on search_string(each address field).
>
> Doing something like this may be necessary even when the data on your
> database is clean, because what you're searching for may well not be.
>
> There are lots of possible improvements to the above method. For
 instance:
> - Make the comparisons case-insensitive.
> - Make search_string() replace common abbreviations.
> - Make search_string() put wildcards between letters only when the word is
> fairly short, or consists only of capitals or only of consonants.
>
> Somewhere around here one should probably abandon the idea of using PL/SQL
> for this purpose.
>
> Andy Bowles
>
>
Received on Thu Mar 14 2002 - 22:57:54 CET

Original text of this message