Oracle FAQ Your Portal to the Oracle Knowledge Grid
HOME | ASK QUESTION | ADD INFO | SEARCH | E-MAIL US
 

Home -> Community -> Mailing Lists -> Oracle-L -> Re: Algorithm or ideas wanted for creative text parsing

Re: Algorithm or ideas wanted for creative text parsing

From: Stephane Faroult <sfaroult_at_roughsea.com>
Date: Mon, 10 Apr 2006 19:37:15 +0200
Message-ID: <443A97CB.7050406@roughsea.com>


Raj,

    I did something similar at one time and didn't find anything cleverer than storing somewhere how many "segments" are significant for one given substr(your_stuff, instr(your_stuff, '.', -1, 1) + 1). For instance, with a .com, .net or .edu you just need the previous piece, for a .uk or a .sg you need the two previous pieces. But it would be too easy if it were as simple, because for .ca you can have big companies that are myname.ca or smaller ones that are monnom.qc.ca. Same story with .us, often (but not always) preceded by a state code, or with .fr because you can have generic stuff (such as .gouv) preceding the termination.

Brace yourself for CASE clauses of death in your statements ...

HTH Stéphane Faroult

rjamya wrote:

>Basically I am looking to isolate just the (distinct) domain name from
>fully qualified domain names that you'd normally see in web-surfing.
>
>I am working on couple of techniques, but it gets complicated since
>TLDs differ in format and there is only so much you can do with
>substr().
>
>sample data ...
>
>a836.v8519e.c8519.g.vm.akamaistream.net
>a705.l1923962123.c19239.n.lm.akamaistream.net
>db.c7.bf.a0.top.list.ru
>a1657.l1923962104.c19239.n.lm.akamaistream.net
>a1181.v21080b.c21080.g.vm.akamaistream.net
>dl1.games.vip.scd.yahoo.com
>lcp.mud.us.music.yahoo.com
>www.celhs.osceola.k12.fl.us
>www.celhs.osceola.k12.fl.us
>www.celhs.osceola.k12.fl.us
>w.s0.gc.sj.ipixmedia.com
>w.s0.gc.sj.ipixmedia.com
>v.s0.gc.sj.ipixmedia.com
>us.1.p6.webhosting.yahoo.com
>p1.music.vip.sc5.yahoo.com
>lib1.store.vip.sc5.yahoo.com
>www.twingroves.district96.k12.il.us
>www.twingroves.district96.k12.il.us
>www.the-simpsons.hpg.ig.com.br
>www.schools.pinellas.k12.fl.us
>www.rails4days.pwp.blueyonder.co.uk
>www.rails4days.pwp.blueyonder.co.uk
>www.garrp.dhr.state.ga.us
>www.celhs.osceola.k12.fl.us
>www.williamrobertson.pwp.blueyonder.co.uk
>www.williamrobertson.pwp.blueyonder.co.uk
>lcp.mud.us.music.yahoo.com
>c.s0.gc.sj.ipixmedia.com
>c.s0.gc.sj.ipixmedia.com
>ax.phobos.apple.com
>ax.phobos.apple.com
>0982660.1206.feed.yellowpagecity.com
>0982660.1207.feed.yellowpagecity.com
>
>and by some magic the output should be ....
>
>akamaistream.net
>apple.com
>yahoo.com
>fl.us
>ipixmedia.com
>il.us
>ig.com.br
>blueyonder.co.uk
>ga.us
>yellowpagecity.com
>
>Any ideas, thoughts? I'd prefer to do this in SQL if possible, else
>I'd prefer plsql. The data is already in a 10.1.0.4 database.
>
>Thanks in advance
>Raj
>----------------------------------------------
>Got RAC?
>--
>http://www.freelists.org/webpage/oracle-l
>
>
>
>
>

--
http://www.freelists.org/webpage/oracle-l
Received on Mon Apr 10 2006 - 12:37:15 CDT

Original text of this message

HOME | ASK QUESTION | ADD INFO | SEARCH | E-MAIL US