Question

Parse A Embl Like Format To Fasta Format

1

Entering edit mode

14.3 years ago

Empyrean ▴ 170

Helo all, I wanted to parse aEMBL format like file to fasta. i cannot use bioperl because this is not complete EMBL format. so please suggest me how to get this done..

ID  US74811111-0005    
OO  giensis    
OS  giensis    
SN  US74811111    
PT  I-003, a gene and methods for its use
PA  NIX CORPORATION RESEARCH TRIANGLE PARK, NC
PI  Carozzi; Nadine (Raleigh, NC); Hargiss; Tracy (Cary, NC); Koziel; Michael G. (Raleigh, NC); Duck; Nicholas B. (Apex, NC); Carr; Brian (Raleigh, NC);
PR  20030828 US20030498518P; 20040826 US20040926819; 20070620 US20070765494;
PE  US200304985AN  20070765494
P1  Compositions and methods and seeds are provided. 
    MDNNPNINECIPYNCLSNPEVEVLGGERIETGYTPIDISLSLTQFLLSEFVPGAGFVLGLVDIIWGIFGPSQWDAFPVQIEQLINQRIEEFARNQAISRLEGLSNLYVTIHEIENNTDELKFSNCVEEEIYPNNTVTCNDYTVNQEEYGGAYTSRNRGYNEAPSVPADYASVYEEKSYTDGRRENPCEFNRGYRDYTPLPVGYVTKELEYFPETDKVWIEIGETEGTFIVDSVELLLMEE
//

The output should be in fasta format which consists of lines starting with ID, PT, PA and Sequence. "//" the two slashes are dividing lines between two EMBL genes.

>US74811111-0005 ;  I-003, a gene and methods for its use ; NIX CORPORATION RESEARCH TRIANGLE PARK, NC 
MDNNPNINECIPYNCLSNPEVEVLGGERIETGYTPIDISLSLTQFLLSEFVPGAGFVLGLVDIIWGIFGPSQWDAFPVQIEQLINQRIEEFARNQAISRLEGLSNLYVTIHEIENNTDELKFSNCVEEEIYPNNTVTCNDYTVNQEEYGGAYTSRNRGYNEAPSVPADYASVYEEKSYTDGRRENPCEFNRGYRDYTPLPVGYVTKELEYFPETDKVWIEIGETEGTFIVDSVELLLMEE

Like this i have 50,000 sequences in a single file which should be converted to fasta format

perl awk format conversion • 8.2k views

ADD COMMENT • link updated 14.3 years ago by Alex Richter ▴ 210 • written 14.3 years ago by Empyrean ▴ 170

score 6 · Answer 1 · 2011-05-12

6

Entering edit mode

14.3 years ago

Pierre Lindenbaum 166k

the following awk script should do the job:

/^ID/   {printf(">%s;",$0); next;}
/^(PT|PA)/  {printf(" %s;",$0); next;}
/^\/\// {printf("\n"); next;}
/^    / {printf("\n%s",substr($0,5)); next;}
    {
    /* ignore default */
    }
END   {
    printf("\n");
    }

> awk -f file.awk file.txt
>ID  US74811111-0005    ; PT  I-003, a gene and methods for its use; PA  NIX CORPORATION RESEARCH TRIANGLE PARK, NC;
MDNNPNINECIPYNCLSNPEVEVLGGERIETGYTPIDISLSLTQFLLSEFVPGAGFVLGLVDIIWGIFGPSQWDAFPVQIEQLINQRIEEFARNQAISRLEGLSNLYVTIHEIENNTDELKFSNCVEEEIYPNNTVTCNDYTVNQEEYGGAYTSRNRGYNEAPSVPADYASVYEEKSYTDGRRENPCEFNRGYRDYTPLPVGYVTKELEYFPETDKVWIEIGETEGTFIVDSVELLLMEE

ADD COMMENT • link 14.3 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

this is not printing the sequence at the end.. thanks for the reply

ADD REPLY • link 14.3 years ago by Empyrean ▴ 170

0

Entering edit mode

I've added a 'END' close in the script.

ADD REPLY • link 14.3 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

I've added a 'END' statement in the script.

ADD REPLY • link 14.3 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

works good but if i have sequence in multiple lines its printing only last line in sequence.. say for example..

"LLGTFDECYPTYLYQKIDESKLKAYTRYQLRGYIEDSQDLEIYLIRYNAKHETVNVPGTGSLWPLSAQSPIGKCGEPNRC APHLEWNPDLDCSCRDGEKCAHHSHHFSLDIDVGCTDLNEDLGVWVIFKIKTQDGHARLGNLEFLEEKPLVGEALARVKR AEKKWRDKREKLEWETNIVYKEAKESVDALFVNSQYDQLQADTNIAMIHAADKRVHSIREAYLPELSVIPGVNAAIFEEL EGRIFTAFSLYDARNVIKNGDFNNGLSCWNVKGHVDVEEQNNQRSVLVVPEWEAEVSQEVRVCPGRGYILRVTAYKEGYG EGCVTIHEIENNTDELKFSNCVEEEIYPNNTVTCNDYTVNQEEYGGAYTSRNRGYNEAPSVPADYASVYEEKSYTDGRRE NPCEFNRGYRDYTPLPVGYVTKELEYFPETDKVWIEIGETEGTFIVDSVELLLMEE"

ADD REPLY • link 14.3 years ago by Empyrean ▴ 170

0

Entering edit mode

works good but if i have sequence in multiple lines its printing only last line in sequence.. say for example.. " LLGTFDECYPTYLYQKIDESKLKAYTRYQLRGYIEDSQDLEIYLIRYNAKHETVNVPGTGSLWPLSAQSPIGKCGEPNRC APHLEWNPDLDCSCRDGEKCAHHSHHFSLDIDVGCTDLNEDLGVWVIFKIKTQDGHARLGNLEFLEEKPLVGEALARVKR AEKKWRDKREKLEWETNIVYKEAKESVDALFVNSQYDQLQADTNIAMIHAADKRVHSIREAYLPELSVIPGVNAAIFEEL EGRIFTAFSLYDARNVIKNGDFNNGLSCWNVKGHVDVEEQNNQRSVLVVPEWEAEVSQEVRVCPGRGYILRVTAYKEGYG EGCVTIHEIENNTDELKFSNCVEEEIYPNNTVTCNDYTVNQEEYGGAYTSRNRGYNEAPSVPADYASVYEEKSYTDGRRE NPCEFNRGYRDYTPLPVGYVTKELEYFPETDKVWIEIGETEGTFIVDSVELLLMEE"

ADD REPLY • link 14.3 years ago by Empyrean ▴ 170

0

Entering edit mode

Can you use this script to process multiple input files and output to multiple files such as awk -f file.awk *.txt >> *.fasta

ADD REPLY • link 13.1 years ago by biomed.lu ▴ 20

score 2 · Answer 2 · 2011-05-14

This Python script should work? Takes input file as an argument e.g. embllike2fasta.py infile.txt:

import sys
dict = {}
infile = open(sys.argv[1], "r")
outfile = open("outfile.fas", "w")
while 1:
    line = infile.readline()
    if not line:
        break
    parts = line.split(None, 1)
    if len(parts) == 1 and parts[0] == "//":
        outfile.write(">" + dict["ID"] + " ; " + dict["PT"] + " ; " + dict["PA"] + "\n")
        outfile.write(dict["seq"] + "\n")
    elif len(parts) == 1:
        dict["seq"] = parts[0].strip()
    else:
        dict[parts[0]] = parts[1].strip()
outfile.close()
infile.close()

I haven't tested it, and just hacked it up in the browser on my iPad, so it might not work? Plus it's late :-S lol!

Update: Tested it on my machine this morning, made a few quick edits! Very hacky, but it works :D

score 1 · Answer 3 · 2011-05-13

1

Entering edit mode

14.3 years ago

Alex Richter ▴ 210

How about this? It's remarkably hacky, but should do the trick.

#!/usr/bin/env perl
use strict;
use warnings;
$/ = "//\n";
while (<>) {
    chomp;
    my %record;
    foreach my $line (split /\n/) {
        my ($key, $val) = unpack("A4A*", $line);
        $record{$key} .= $val;
    }
    print ">$record{ID} $record{PT} $record{PA}\n$record{''}\n";
}

The important bits are lines 4:($/ = "//n"), which sets the system to read one record at a time instead of a line, and 9:(unpack("A4A*", $line)), which splits the string into the EMBL key and the value.

ADD COMMENT • link 14.2 years ago by Alex Richter ▴ 210

0

Entering edit mode

I am getting this error when i execute this

Can't modify single ref constructor in scalar assignment at embl2fasta.pl line 4, near ""//n";"

ADD REPLY • link 14.3 years ago by Empyrean ▴ 170

0

Entering edit mode

I am getting this error when i execute this "Can't modify single ref constructor in scalar assignment at eml2fasa.pl line 4, near ""//n";" "

ADD REPLY • link 14.3 years ago by Empyrean ▴ 170

0

Entering edit mode

Hmm. That error normally only occurs when you attempt to assign a value to a reference (e.g. $/ = "//n"; ) If you can show me lines 3-5, I might have a better idea.

ADD REPLY • link 14.2 years ago by Alex Richter ▴ 210

score 0 · Answer 4 · 2011-05-13

0

Entering edit mode

14.3 years ago

Ketil 4.2k

Or just:

egrep -e '^(ID|  )' | sed -e 's/^ID  />/g' -e 's/^    //g'

Oh, I see you wanted the contents of PT there too. I leave it as an excercise for the reader! :-)

ADD COMMENT • link 14.3 years ago by Ketil 4.2k

0

Entering edit mode

i dont know how to use sed :(

ADD REPLY • link 14.3 years ago by Empyrean ▴ 170

0

Entering edit mode

i ran it but that was not even close to what i need, its just adding ">" symbol in place of ID and rest of input file stays as is

ADD REPLY • link 14.3 years ago by Empyrean ▴ 170

0

Entering edit mode

Strange. The egrep in front should pick out only ID and sequence lines from the file, so I don't see how it could retain the rest of the file. The sed stuff replaces ID, but it should also remove initial spaces on the sequence data. But it's just a quick hack anyway, you're probably better off with some other solution.

ADD REPLY • link 14.3 years ago by Ketil 4.2k

0

Entering edit mode

I guess I should add that I only tested it on the example in the question, so if that isn't representative of the actual data, all bets are off, of course.

ADD REPLY • link 14.3 years ago by Ketil 4.2k

score 0 · Answer 5 · 2011-05-14

0

Entering edit mode

14.3 years ago

Gaffa ▴ 500

This one-liner should do it:

perl -ne 'if ($_ =~ /^ID  (.+)/) { print ">$1" } elsif ($_ =~ /^P[TA]  (.+)/) { print "; $1 " } elsif ($_ =~ /^\s{4}(.+)/) { print "\n$1\n" }'

Output:

>US74811111-0005    ; I-003, a gene and methods for its use ; NIX CORPORATION RESEARCH TRIANGLE PARK, NC 
MDNNPNINECIPYNCLSNPEVEVLGGERIETGYTPIDISLSLTQFLLSEFVPGAGFVLGLVDIIWGIFGPSQWDAFPVQIEQLINQRIEEFARNQAISRLEGLSNLYVTIHEIENNTDELKFSNCVEEEIYPNNTVTCNDYTVNQEEYGGAYTSRNRGYNEAPSVPADYASVYEEKSYTDGRRENPCEFNRGYRDYTPLPVGYVTKELEYFPETDKVWIEIGETEGTFIVDSVELLLMEE