Re: SQLLDR question

From: Dinanath <sulakhe_at_mcs.anl.gov>
Date: 24 Nov 2004 12:25:23 -0800
Message-ID: <103eadf.0411241225.63beffc7_at_posting.google.com>


"Access" <idmwarpzone_NOSPAM__at_yahoo.com> wrote in message news:<41a3bc92$0$29504$ba620e4c_at_news.skynet.be>...
> "Dinanath" <sulakhe_at_mcs.anl.gov> wrote in message
> news:103eadf.0411231235.22fc0e2f_at_posting.google.com...
> > Hi,
> > I have file that contains PROTEIN sequences in it. I will show the
> > format below. It has about 2million records in it. I wanted to know if
> > someone can suggest a sqlldr control file that would help in uploading
> > the data that I need to upload. I don't want all the data from this
> > file but only some data. I tried to figure out on my own but i
> > couldn't do it..
> >
> > First look at the example of this file: Looks complex but is very
> > simple.
> >
> ===========================================================================
> > >gi|2695847|emb|CAA73704.1| immunoglobulin heavy chain
> >
> MGILTALCIIMTALSSVRSDVVLTESGPAVIKPGESHKLSCKASGFTFSSAYMSWVRQAPGKGLEWVAYIYSGGSS
> TYY
> >
> AQSVQGRFAISRDDSNSMLYLQMNSLKTEDTAVYYCARGGLGWSLDYWGKGTMITVTSATPSPPTVFPLMESCCLS
> DIS
> > GPVATGCLATGFCLPPRPSRGLINLEKL
> > >gi|2695851|emb|CAA73709.1| immunoglobulin heavy chain
> >
> ABLHOOJIHOIHOIDFIHDFMGILTALCIIMTALSSVRSDVVLTESGPAVVKPGESHKLSCKAAGFTFSSYWMGWV
> RQT
> >
> PGKGLEWVSIISAGGSTYYAPSVEGRFTISRDNSNSMLYLQMNSLKTEDTAMYYCARKPETGSYGNISFEHWGKGT
> MIT
> > VTSATPSPPTVFPLMQACCSVDVTGPSATGCLATEF
> > >gi|2695853|emb|CAA73712.1| immunoglobulin heavy chain
> >
> POYOGHOSIHLDSKHSDLFHMGILTALCIIMTALSSVRSDVVLTESGPAVIKPGESHKLSCKASGFTFSSNNMGWV
> RQA
> > PGKGLEWVSTISYSVNAYYAQASLDHIADFKSFDLADKFLKJADSLFKAFD
> > >gi|....|emb|.........
> > ...............................
> > ..........so on.. upto 2million records.
> > ==========================================================
> >
> > There two things in this file for each record.
> > One is Header: that starts with ">"
> > Two is SEQUENCE: lines after header till the next header (letters in
> > CAPS).
> >
> > The next record again start with a ">". and so on..
> >
> > I am interested to pick only two fields out of each records:
> > 1. GI number: that is the number between ">gi|" and "|emb|" eg:2695851
> > (for firt record).
> > 2. Sequence: example, line numbers 2,3, and 4 of the first record.
> > MGILTA......NLEKL
> >
> > Can anyone write an sqlldr control file that can extract this info and
> > put in the following table:
> >
> > create table sequences
> > (
> > gi_number NUMBER NOT NULL,
> > sequence CLOB NOT NULL
> > );
> >
> > Please notice the second column is a clob. because the sequence might
> > go beyond 4000 characters sometimes so i can't fit it into varchar.
> >
> > I would appreciate any help and sorry if its not the right place to
> > post such messages. Please let me know where can post this message if
> > not here.
> >
> > Thank you very much,
> > Cheers,
> > Dina
>
> Which version of Oracle do you use ?
> Maybe you could try to load the file with the external table feature (Oracle
> 9i and above).
>
> See this page for examples :
> http://www.dbasupport.com/oracle/ora9i/External_Tables9i.shtml
>
>
> HTH
> Matthias Hoys

I am using Oracle 9i. And thanks for the link on external table, i think it would help me.

Thanks,
Dina Received on Wed Nov 24 2004 - 21:25:23 CET

Original text of this message