Date: 23 Nov 2004 12:35:22 -0800
[Quoted] [Quoted] I have file that contains PROTEIN sequences in it. I will show the [Quoted] format below. It has about 2million records in it. I wanted to know if someone can suggest a sqlldr control file that would help in uploading the data that I need to upload. I don't want all the data from this file but only some data. I tried to figure out on my own but i couldn't do it..
First look at the example of this file: Looks complex but is very simple.
>gi|2695847|emb|CAA73704.1| immunoglobulin heavy chain
MGILTALCIIMTALSSVRSDVVLTESGPAVIKPGESHKLSCKASGFTFSSAYMSWVRQAPGKGLEWVAYIYSGGSSTYY AQSVQGRFAISRDDSNSMLYLQMNSLKTEDTAVYYCARGGLGWSLDYWGKGTMITVTSATPSPPTVFPLMESCCLSDIS GPVATGCLATGFCLPPRPSRGLINLEKL
>gi|2695851|emb|CAA73709.1| immunoglobulin heavy chain
ABLHOOJIHOIHOIDFIHDFMGILTALCIIMTALSSVRSDVVLTESGPAVVKPGESHKLSCKAAGFTFSSYWMGWVRQT PGKGLEWVSIISAGGSTYYAPSVEGRFTISRDNSNSMLYLQMNSLKTEDTAMYYCARKPETGSYGNISFEHWGKGTMIT VTSATPSPPTVFPLMQACCSVDVTGPSATGCLATEF
>gi|2695853|emb|CAA73712.1| immunoglobulin heavy chain
..........so on.. upto 2million records.
There two things in this file for each record.
One is Header: that starts with ">"
Two is SEQUENCE: lines after header till the next header (letters in CAPS). The next record again start with a ">". and so on..
I am interested to pick only two fields out of each records:
1. GI number: that is the number between ">gi|" and "|emb|" eg:2695851
(for firt record).
2. Sequence: example, line numbers 2,3, and 4 of the first record. MGILTA......NLEKL Can anyone write an sqlldr control file that can extract this info and put in the following table:
create table sequences
gi_number NUMBER NOT NULL,
sequence CLOB NOT NULL
Please notice the second column is a clob. because the sequence might go beyond 4000 characters sometimes so i can't fit it into varchar.
I would appreciate any help and sorry if its not the right place to post such messages. Please let me know where can post this message if not here.
Thank you very much,
Dina Received on Tue Nov 23 2004 - 21:35:22 CET