Oracle FAQ Your Portal to the Oracle Knowledge Grid
HOME | ASK QUESTION | ADD INFO | SEARCH | E-MAIL US
 

Home -> Community -> Mailing Lists -> Oracle-L -> Re: Algorithm or ideas wanted for creative text parsing

Re: Algorithm or ideas wanted for creative text parsing

From: Giovanni Cuccu <giovanni.cuccu_at_gmail.com>
Date: Tue, 11 Apr 2006 09:17:39 +0200
Message-ID: <23e0d1170604110017w755df54by5059395f9470e1f1@mail.gmail.com>


Hi Raj,

   If the exception list grows I think you could improve your procedure putting the domain exceptions in a table and looping on that table to find the domain you need.
the domain table could have this structure: create table DOMAINS (

   LIKE_PATTERN varchar2(102),
   DOMAIN_NAME varchar2(100),
   DOMAIN_ORDER number
);
with a function like
create or replace function getdomain(full_domain in varchar2) return varchar2 is begin
  for domain_match in (select * from DOMAINS order by DOMAIN_ORDER)   loop

     if full_domain LIKE domain_match.like_pattern then
        return domain_match.domain_name;
     end if;

  end loop;
  return null;
end getdomain;

you could manage your exception and even try the more probable ones first (which could lead to less resources usage).This is not production code but I think it may give you an idea of my suggestion. Hope it helps,

   Giovanni

On 4/10/06, rjamya <rjamya_at_gmail.com> wrote:
> So far I have done ....
>
> CASE
> WHEN :new.domain_name LIKE '%.imageshack.us' THEN 'imageshack.us'
> -- we need to collapse these
> WHEN :new.domain_name LIKE '%.adtexh.de' THEN 'adtech.de' --
> we need to collapse these
> WHEN domain_name LIKE '%.echo.cx' THEN 'echo.cx' -- we need to
> collapse these
> WHEN domain_name LIKE '%.exs.cx' THEN 'exs.cx' -- we need to collapse these
> WHEN domain_name LIKE '%.bigoo.ws' THEN 'bigoo.ws' -- we need to
> collapse these
> -- check if the domain name is just numeric
> WHEN regexp_like(trim(domain_name),'[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}$')
> THEN domain_name
> -- if it is .com/.name/.org etc get last two pieces
> WHEN length(szlastpiece) > 2 THEN substrdomain_name,pos1+1)
> -- if last piece is a country tld, then get the last 3 pieces
> ELSE substr(domain_name,pos2+1)
> END
>
> szLast piece is the piece after the last '.' in the domain name and
> pos1, pos2 refer to positions of '.' (second from last and 3rd from
> last respectively).
>
> Any gotcha's, missed rules are welcome. The results so far are pretty
> good ... a sample query on 4m+ rows so far shows reliable output.
> there will always be caveats, but I am happy with 99% "hit ratio" ...
> though any improvements are always welcome.
>
> uh oh ... I said the H word ... I am marked now ...
> Raj
>
> On 4/10/06, Gus Spier <gspier_at_chiliad.com> wrote:
> > Raj,
> > It looks to me like you're going to have to do some rule based ETL. Start
> > by parsing your URIs on the dots into varrays and then examinining data ....
> > if seg. first == 'www' and seg.last == "com" then harvest seg.length-1 ...
> >
> > if seg.last='uk' and seg.length-1='co' then harvest seg.length-2 ...
> >
> > et cetera ad endless nauseam.
> >
> > But I don't think you can build a script that will reliably trundle out
> > there and correctly get what you want first try.
> >
> > Good luck
> > Gus
> --
> ----------------------------------------------
> Got RAC?
> --
> http://www.freelists.org/webpage/oracle-l
>
>
>

--
--------------------------------------------------------------------
Another free oracle resource profiler
http://sourceforge.net/projects/oraresprof/
Now version 0.9
--
http://www.freelists.org/webpage/oracle-l
Received on Tue Apr 11 2006 - 02:17:39 CDT

Original text of this message

HOME | ASK QUESTION | ADD INFO | SEARCH | E-MAIL US