Parse A Embl Like Format To Fasta Format
6
1
Entering edit mode
13.6 years ago
Empyrean ▴ 170

Helo all, I wanted to parse aEMBL format like file to fasta. i cannot use bioperl because this is not complete EMBL format. so please suggest me how to get this done..

ID  US74811111-0005    
OO  giensis    
OS  giensis    
SN  US74811111    
PT  I-003, a gene and methods for its use
PA  NIX CORPORATION RESEARCH TRIANGLE PARK, NC
PI  Carozzi; Nadine (Raleigh, NC); Hargiss; Tracy (Cary, NC); Koziel; Michael G. (Raleigh, NC); Duck; Nicholas B. (Apex, NC); Carr; Brian (Raleigh, NC);
PR  20030828 US20030498518P; 20040826 US20040926819; 20070620 US20070765494;
PE  US200304985AN  20070765494
P1  Compositions and methods and seeds are provided. 
    MDNNPNINECIPYNCLSNPEVEVLGGERIETGYTPIDISLSLTQFLLSEFVPGAGFVLGLVDIIWGIFGPSQWDAFPVQIEQLINQRIEEFARNQAISRLEGLSNLYVTIHEIENNTDELKFSNCVEEEIYPNNTVTCNDYTVNQEEYGGAYTSRNRGYNEAPSVPADYASVYEEKSYTDGRRENPCEFNRGYRDYTPLPVGYVTKELEYFPETDKVWIEIGETEGTFIVDSVELLLMEE
//

The output should be in fasta format which consists of lines starting with ID, PT, PA and Sequence. "//" the two slashes are dividing lines between two EMBL genes.

>US74811111-0005 ;  I-003, a gene and methods for its use ; NIX CORPORATION RESEARCH TRIANGLE PARK, NC 
MDNNPNINECIPYNCLSNPEVEVLGGERIETGYTPIDISLSLTQFLLSEFVPGAGFVLGLVDIIWGIFGPSQWDAFPVQIEQLINQRIEEFARNQAISRLEGLSNLYVTIHEIENNTDELKFSNCVEEEIYPNNTVTCNDYTVNQEEYGGAYTSRNRGYNEAPSVPADYASVYEEKSYTDGRRENPCEFNRGYRDYTPLPVGYVTKELEYFPETDKVWIEIGETEGTFIVDSVELLLMEE

Like this i have 50,000 sequences in a single file which should be converted to fasta format

perl awk format conversion • 7.3k views
ADD COMMENT
6
Entering edit mode
13.6 years ago

the following awk script should do the job:

/^ID/   {printf(">%s;",$0); next;}
/^(PT|PA)/  {printf(" %s;",$0); next;}
/^\/\// {printf("\n"); next;}
/^    / {printf("\n%s",substr($0,5)); next;}
    {
    /* ignore default */
    }
END   {
    printf("\n");
    }

> awk -f file.awk file.txt
>ID  US74811111-0005    ; PT  I-003, a gene and methods for its use; PA  NIX CORPORATION RESEARCH TRIANGLE PARK, NC;
MDNNPNINECIPYNCLSNPEVEVLGGERIETGYTPIDISLSLTQFLLSEFVPGAGFVLGLVDIIWGIFGPSQWDAFPVQIEQLINQRIEEFARNQAISRLEGLSNLYVTIHEIENNTDELKFSNCVEEEIYPNNTVTCNDYTVNQEEYGGAYTSRNRGYNEAPSVPADYASVYEEKSYTDGRRENPCEFNRGYRDYTPLPVGYVTKELEYFPETDKVWIEIGETEGTFIVDSVELLLMEE
ADD COMMENT
0
Entering edit mode

this is not printing the sequence at the end.. thanks for the reply

ADD REPLY
0
Entering edit mode

I've added a 'END' close in the script.

ADD REPLY
0
Entering edit mode

I've added a 'END' statement in the script.

ADD REPLY
0
Entering edit mode

works good but if i have sequence in multiple lines its printing only last line in sequence.. say for example..

"LLGTFDECYPTYLYQKIDESKLKAYTRYQLRGYIEDSQDLEIYLIRYNAKHETVNVPGTGSLWPLSAQSPIGKCGEPNRC APHLEWNPDLDCSCRDGEKCAHHSHHFSLDIDVGCTDLNEDLGVWVIFKIKTQDGHARLGNLEFLEEKPLVGEALARVKR AEKKWRDKREKLEWETNIVYKEAKESVDALFVNSQYDQLQADTNIAMIHAADKRVHSIREAYLPELSVIPGVNAAIFEEL EGRIFTAFSLYDARNVIKNGDFNNGLSCWNVKGHVDVEEQNNQRSVLVVPEWEAEVSQEVRVCPGRGYILRVTAYKEGYG EGCVTIHEIENNTDELKFSNCVEEEIYPNNTVTCNDYTVNQEEYGGAYTSRNRGYNEAPSVPADYASVYEEKSYTDGRRE NPCEFNRGYRDYTPLPVGYVTKELEYFPETDKVWIEIGETEGTFIVDSVELLLMEE"

ADD REPLY
0
Entering edit mode

works good but if i have sequence in multiple lines its printing only last line in sequence.. say for example.. " LLGTFDECYPTYLYQKIDESKLKAYTRYQLRGYIEDSQDLEIYLIRYNAKHETVNVPGTGSLWPLSAQSPIGKCGEPNRC APHLEWNPDLDCSCRDGEKCAHHSHHFSLDIDVGCTDLNEDLGVWVIFKIKTQDGHARLGNLEFLEEKPLVGEALARVKR AEKKWRDKREKLEWETNIVYKEAKESVDALFVNSQYDQLQADTNIAMIHAADKRVHSIREAYLPELSVIPGVNAAIFEEL EGRIFTAFSLYDARNVIKNGDFNNGLSCWNVKGHVDVEEQNNQRSVLVVPEWEAEVSQEVRVCPGRGYILRVTAYKEGYG EGCVTIHEIENNTDELKFSNCVEEEIYPNNTVTCNDYTVNQEEYGGAYTSRNRGYNEAPSVPADYASVYEEKSYTDGRRE NPCEFNRGYRDYTPLPVGYVTKELEYFPETDKVWIEIGETEGTFIVDSVELLLMEE"

ADD REPLY
0
Entering edit mode

Can you use this script to process multiple input files and output to multiple files such as awk -f file.awk *.txt >> *.fasta

ADD REPLY
2
Entering edit mode
13.6 years ago

This Python script should work? Takes input file as an argument e.g. embllike2fasta.py infile.txt:

import sys
dict = {}
infile = open(sys.argv[1], "r")
outfile = open("outfile.fas", "w")
while 1:
    line = infile.readline()
    if not line:
        break
    parts = line.split(None, 1)
    if len(parts) == 1 and parts[0] == "//":
        outfile.write(">" + dict["ID"] + " ; " + dict["PT"] + " ; " + dict["PA"] + "\n")
        outfile.write(dict["seq"] + "\n")
    elif len(parts) == 1:
        dict["seq"] = parts[0].strip()
    else:
        dict[parts[0]] = parts[1].strip()
outfile.close()
infile.close()

I haven't tested it, and just hacked it up in the browser on my iPad, so it might not work? Plus it's late :-S lol!

Update: Tested it on my machine this morning, made a few quick edits! Very hacky, but it works :D

ADD COMMENT
1
Entering edit mode
13.6 years ago
Alex Richter ▴ 210

How about this? It's remarkably hacky, but should do the trick.

#!/usr/bin/env perl
use strict;
use warnings;
$/ = "//\n";
while (<>) {
    chomp;
    my %record;
    foreach my $line (split /\n/) {
        my ($key, $val) = unpack("A4A*", $line);
        $record{$key} .= $val;
    }
    print ">$record{ID} $record{PT} $record{PA}\n$record{''}\n";
}

The important bits are lines 4:($/ = "//n"), which sets the system to read one record at a time instead of a line, and 9:(unpack("A4A*", $line)), which splits the string into the EMBL key and the value.

ADD COMMENT
0
Entering edit mode

I am getting this error when i execute this

Can't modify single ref constructor in scalar assignment at embl2fasta.pl line 4, near ""//n";"

ADD REPLY
0
Entering edit mode

I am getting this error when i execute this "Can't modify single ref constructor in scalar assignment at eml2fasa.pl line 4, near ""//n";" "

ADD REPLY
0
Entering edit mode

Hmm. That error normally only occurs when you attempt to assign a value to a reference (e.g. $/ = "//n"; ) If you can show me lines 3-5, I might have a better idea.

ADD REPLY
0
Entering edit mode
13.6 years ago
Ketil 4.1k

Or just:

egrep -e '^(ID|  )' | sed -e 's/^ID  />/g' -e 's/^    //g'

Oh, I see you wanted the contents of PT there too. I leave it as an excercise for the reader! :-)

ADD COMMENT
0
Entering edit mode

i dont know how to use sed :(

ADD REPLY
0
Entering edit mode

i ran it but that was not even close to what i need, its just adding ">" symbol in place of ID and rest of input file stays as is

ADD REPLY
0
Entering edit mode

Strange. The egrep in front should pick out only ID and sequence lines from the file, so I don't see how it could retain the rest of the file. The sed stuff replaces ID, but it should also remove initial spaces on the sequence data. But it's just a quick hack anyway, you're probably better off with some other solution.

ADD REPLY
0
Entering edit mode

I guess I should add that I only tested it on the example in the question, so if that isn't representative of the actual data, all bets are off, of course.

ADD REPLY
0
Entering edit mode
13.6 years ago
Gaffa ▴ 500

This one-liner should do it:

perl -ne 'if ($_ =~ /^ID  (.+)/) { print ">$1" } elsif ($_ =~ /^P[TA]  (.+)/) { print "; $1 " } elsif ($_ =~ /^\s{4}(.+)/) { print "\n$1\n" }'

Output:

>US74811111-0005    ; I-003, a gene and methods for its use ; NIX CORPORATION RESEARCH TRIANGLE PARK, NC 
MDNNPNINECIPYNCLSNPEVEVLGGERIETGYTPIDISLSLTQFLLSEFVPGAGFVLGLVDIIWGIFGPSQWDAFPVQIEQLINQRIEEFARNQAISRLEGLSNLYVTIHEIENNTDELKFSNCVEEEIYPNNTVTCNDYTVNQEEYGGAYTSRNRGYNEAPSVPADYASVYEEKSYTDGRRENPCEFNRGYRDYTPLPVGYVTKELEYFPETDKVWIEIGETEGTFIVDSVELLLMEE
ADD COMMENT
0
Entering edit mode

not working for my dataset.. :(

ADD REPLY

Login before adding your answer.

Traffic: 2404 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6