Re: SQLLDR question

From: Frank van Bortel <fvanbortel_at_netscape.net>
Date: Thu, 25 Nov 2004 14:41:11 +0100
Message-ID: <co232m$d3e$1@news1.zwoll1.ov.home.nl>

Dinanath wrote:
> Hi,
> I have file that contains PROTEIN sequences in it. I will show the
> format below. It has about 2million records in it. I wanted to know if
> someone can suggest a sqlldr control file that would help in uploading
> the data that I need to upload. I don't want all the data from this
> file but only some data. I tried to figure out on my own but i
> couldn't do it..
>
> First look at the example of this file: Looks complex but is very
> simple.
> ===========================================================================
>

>>gi|2695847|emb|CAA73704.1| immunoglobulin heavy chain

>
> MGILTALCIIMTALSSVRSDVVLTESGPAVIKPGESHKLSCKASGFTFSSAYMSWVRQAPGKGLEWVAYIYSGGSSTYY
> AQSVQGRFAISRDDSNSMLYLQMNSLKTEDTAVYYCARGGLGWSLDYWGKGTMITVTSATPSPPTVFPLMESCCLSDIS
> GPVATGCLATGFCLPPRPSRGLINLEKL
>

>>gi|2695851|emb|CAA73709.1| immunoglobulin heavy chain

>
> ABLHOOJIHOIHOIDFIHDFMGILTALCIIMTALSSVRSDVVLTESGPAVVKPGESHKLSCKAAGFTFSSYWMGWVRQT
> PGKGLEWVSIISAGGSTYYAPSVEGRFTISRDNSNSMLYLQMNSLKTEDTAMYYCARKPETGSYGNISFEHWGKGTMIT
> VTSATPSPPTVFPLMQACCSVDVTGPSATGCLATEF
>

>>gi|2695853|emb|CAA73712.1| immunoglobulin heavy chain

>
> POYOGHOSIHLDSKHSDLFHMGILTALCIIMTALSSVRSDVVLTESGPAVIKPGESHKLSCKASGFTFSSNNMGWVRQA
> PGKGLEWVSTISYSVNAYYAQASLDHIADFKSFDLADKFLKJADSLFKAFD
>

>>gi|....|emb|.........

>
> ...............................
> ..........so on.. upto 2million records.
> ==========================================================
>
> There two things in this file for each record.
> One is Header: that starts with ">"
> Two is SEQUENCE: lines after header till the next header (letters in
> CAPS).
>
> The next record again start with a ">". and so on..
>
> I am interested to pick only two fields out of each records:
> 1. GI number: that is the number between ">gi|" and "|emb|" eg:2695851
> (for firt record).
> 2. Sequence: example, line numbers 2,3, and 4 of the first record.
> MGILTA......NLEKL
>
> Can anyone write an sqlldr control file that can extract this info and
> put in the following table:
>
> create table sequences
> (
> gi_number NUMBER NOT NULL,
> sequence CLOB NOT NULL
> );
>
> Please notice the second column is a clob. because the sequence might
> go beyond 4000 characters sometimes so i can't fit it into varchar.
>
> I would appreciate any help and sorry if its not the right place to
> post such messages. Please let me know where can post this message if
> not here.
>
> Thank you very much,
> Cheers,
> Dina

Just now, you had a comma delimited file... Do your own homework; hints:
- preprocess your file, using OS utilites like grep.
- look up skip command
- look up delimited by "..."
as addition to previous answer.

-- 

Regards,
Frank van Bortel

Received on Thu Nov 25 2004 - 07:41:11 CST