Re: Algorithm or ideas wanted for creative text parsing

From: rjamya <rjamya_at_gmail.com>
Date: Mon, 10 Apr 2006 13:34:43 -0400
Message-ID: <9177895d0604101034t7ec86cb3g63f538aa9e77b933@mail.gmail.com>

Thanks SF and all

maybe here is what I can do ...

if the domain is numeric, take it as it is
if the TLD (i.e. the last piece) is 3 or more characters, you take last 2 pieces (this will cover com,org,edu,name,info,museum etc)
if the last piece is 2 characters (most likely a ccTLD), take last 3 pieces (i.e. il, br, ca, uk etc)

hmmm ... looks promising, am I missing anything? Raj

On 4/10/06, Stephane Faroult <sfaroult_at_roughsea.com> wrote:
> Raj,
>
> I did something similar at one time and didn't find anything
> cleverer than storing somewhere how many "segments" are significant for
> one given substr(your_stuff, instr(your_stuff, '.', -1, 1) + 1).
> For instance, with a .com, .net or .edu you just need the previous
> piece, for a .uk or a .sg you need the two previous pieces. But it would
> be too easy if it were as simple, because for .ca you can have big
> companies that are myname.ca or smaller ones that are monnom.qc.ca. Same
> story with .us, often (but not always) preceded by a state code, or with
> .fr because you can have generic stuff (such as .gouv) preceding the
> termination.
>
> Brace yourself for CASE clauses of death in your statements ...
>
> HTH
> Stéphane Faroult

--
----------------------------------------------
Got RAC?
--
http://www.freelists.org/webpage/oracle-l

Received on Mon Apr 10 2006 - 12:34:43 CDT