Helo all, I wanted to parse aEMBL format like file to fasta. i cannot use bioperl because this is not complete EMBL format. so please suggest me how to get this done..
ID US74811111-0005
OO giensis
OS giensis
SN US74811111
PT I-003, a gene and methods for its use
PA NIX CORPORATION RESEARCH TRIANGLE PARK, NC
PI Carozzi; Nadine (Raleigh, NC); Hargiss; Tracy (Cary, NC); Koziel; Michael G. (Raleigh, NC); Duck; Nicholas B. (Apex, NC); Carr; Brian (Raleigh, NC);
PR 20030828 US20030498518P; 20040826 US20040926819; 20070620 US20070765494;
PE US200304985AN 20070765494
P1 Compositions and methods and seeds are provided.
MDNNPNINECIPYNCLSNPEVEVLGGERIETGYTPIDISLSLTQFLLSEFVPGAGFVLGLVDIIWGIFGPSQWDAFPVQIEQLINQRIEEFARNQAISRLEGLSNLYVTIHEIENNTDELKFSNCVEEEIYPNNTVTCNDYTVNQEEYGGAYTSRNRGYNEAPSVPADYASVYEEKSYTDGRRENPCEFNRGYRDYTPLPVGYVTKELEYFPETDKVWIEIGETEGTFIVDSVELLLMEE
//
The output should be in fasta format which consists of lines starting with ID, PT, PA and Sequence. "//" the two slashes are dividing lines between two EMBL genes.
>US74811111-0005 ; I-003, a gene and methods for its use ; NIX CORPORATION RESEARCH TRIANGLE PARK, NC
MDNNPNINECIPYNCLSNPEVEVLGGERIETGYTPIDISLSLTQFLLSEFVPGAGFVLGLVDIIWGIFGPSQWDAFPVQIEQLINQRIEEFARNQAISRLEGLSNLYVTIHEIENNTDELKFSNCVEEEIYPNNTVTCNDYTVNQEEYGGAYTSRNRGYNEAPSVPADYASVYEEKSYTDGRRENPCEFNRGYRDYTPLPVGYVTKELEYFPETDKVWIEIGETEGTFIVDSVELLLMEE
Like this i have 50,000 sequences in a single file which should be converted to fasta format
this is not printing the sequence at the end.. thanks for the reply
I've added a 'END' close in the script.
I've added a 'END' statement in the script.
works good but if i have sequence in multiple lines its printing only last line in sequence.. say for example..
"LLGTFDECYPTYLYQKIDESKLKAYTRYQLRGYIEDSQDLEIYLIRYNAKHETVNVPGTGSLWPLSAQSPIGKCGEPNRC APHLEWNPDLDCSCRDGEKCAHHSHHFSLDIDVGCTDLNEDLGVWVIFKIKTQDGHARLGNLEFLEEKPLVGEALARVKR AEKKWRDKREKLEWETNIVYKEAKESVDALFVNSQYDQLQADTNIAMIHAADKRVHSIREAYLPELSVIPGVNAAIFEEL EGRIFTAFSLYDARNVIKNGDFNNGLSCWNVKGHVDVEEQNNQRSVLVVPEWEAEVSQEVRVCPGRGYILRVTAYKEGYG EGCVTIHEIENNTDELKFSNCVEEEIYPNNTVTCNDYTVNQEEYGGAYTSRNRGYNEAPSVPADYASVYEEKSYTDGRRE NPCEFNRGYRDYTPLPVGYVTKELEYFPETDKVWIEIGETEGTFIVDSVELLLMEE"
works good but if i have sequence in multiple lines its printing only last line in sequence.. say for example.. " LLGTFDECYPTYLYQKIDESKLKAYTRYQLRGYIEDSQDLEIYLIRYNAKHETVNVPGTGSLWPLSAQSPIGKCGEPNRC APHLEWNPDLDCSCRDGEKCAHHSHHFSLDIDVGCTDLNEDLGVWVIFKIKTQDGHARLGNLEFLEEKPLVGEALARVKR AEKKWRDKREKLEWETNIVYKEAKESVDALFVNSQYDQLQADTNIAMIHAADKRVHSIREAYLPELSVIPGVNAAIFEEL EGRIFTAFSLYDARNVIKNGDFNNGLSCWNVKGHVDVEEQNNQRSVLVVPEWEAEVSQEVRVCPGRGYILRVTAYKEGYG EGCVTIHEIENNTDELKFSNCVEEEIYPNNTVTCNDYTVNQEEYGGAYTSRNRGYNEAPSVPADYASVYEEKSYTDGRRE NPCEFNRGYRDYTPLPVGYVTKELEYFPETDKVWIEIGETEGTFIVDSVELLLMEE"
Can you use this script to process multiple input files and output to multiple files such as awk -f file.awk *.txt >> *.fasta