Oracle FAQ Your Portal to the Oracle Knowledge Grid
HOME | ASK QUESTION | ADD INFO | SEARCH | E-MAIL US
 

Home -> Community -> Mailing Lists -> Oracle-L -> Re: Algorithm or ideas wanted for creative text parsing

Re: Algorithm or ideas wanted for creative text parsing

From: rjamya <rjamya_at_gmail.com>
Date: Mon, 10 Apr 2006 15:46:19 -0400
Message-ID: <9177895d0604101246j18806ed6nbda6082e8051d6d5@mail.gmail.com>


So far I have done ....

CASE
WHEN :new.domain_name LIKE '%.imageshack.us' THEN 'imageshack.us' -- we need to collapse these
WHEN :new.domain_name LIKE '%.adtexh.de' THEN 'adtech.de' -- we need to collapse these
WHEN domain_name LIKE '%.echo.cx' THEN 'echo.cx' -- we need to collapse these

WHEN domain_name LIKE '%.exs.cx' THEN 'exs.cx'     -- we need to collapse these
WHEN domain_name LIKE '%.bigoo.ws' THEN 'bigoo.ws'     -- we need to
collapse these
-- check if the domain name is just numeric WHEN regexp_like(trim(domain_name),'[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}$') THEN domain_name
-- if it is .com/.name/.org etc get last two pieces WHEN length(szlastpiece) > 2 THEN substrdomain_name,pos1+1) -- if last piece is a country tld, then get the last 3 pieces ELSE substr(domain_name,pos2+1)
END szLast piece is the piece after the last '.' in the domain name and pos1, pos2 refer to positions of '.' (second from last and 3rd from last respectively).

Any gotcha's, missed rules are welcome. The results so far are pretty good ... a sample query on 4m+ rows so far shows reliable output. there will always be caveats, but I am happy with 99% "hit ratio" ... though any improvements are always welcome.

uh oh ... I said the H word ... I am marked now ... Raj

On 4/10/06, Gus Spier <gspier_at_chiliad.com> wrote:
> Raj,
> It looks to me like you're going to have to do some rule based ETL. Start
> by parsing your URIs on the dots into varrays and then examinining data ....
> if seg. first == 'www' and seg.last == "com" then harvest seg.length-1 ...
>
> if seg.last='uk' and seg.length-1='co' then harvest seg.length-2 ...
>
> et cetera ad endless nauseam.
>
> But I don't think you can build a script that will reliably trundle out
> there and correctly get what you want first try.
>
> Good luck
> Gus

--
----------------------------------------------
Got RAC?
--
http://www.freelists.org/webpage/oracle-l
Received on Mon Apr 10 2006 - 14:46:19 CDT

Original text of this message

HOME | ASK QUESTION | ADD INFO | SEARCH | E-MAIL US