Question

Challenge: Convert GenBank to Fasta without bioperl, without emboss, or any other dependencies

0

Entering edit mode

9.2 years ago

Lee Katz ★ 3.2k

Here is a possibly difficult challenge. Or maybe not :)

How would I convert a GenBank file to a Fasta file with only the basic Linux tools at my disposal? These would be Perl 5.10, Bash, Awk, Python 2.6, etc

genbank fasta • 6.3k views

ADD COMMENT • link updated 2.3 years ago by Ram 44k • written 9.2 years ago by Lee Katz ★ 3.2k

Ram · Answer 1 · 2015-10-28

2

Entering edit mode

9.2 years ago

Pierre Lindenbaum 164k

a quick google: http://www.goeker.org/mg/scripts/gbk2fas.sed

ADD COMMENT • link 9.2 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Thank you! I could not actually find this script with my google-fu!

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 9.2 years ago by Lee Katz ★ 3.2k

0

Entering edit mode

Doesn't work though :-/. Looks like it was hard coded to work with small genbank files, not larger ones like chromosomes.

ADD REPLY • link 9.2 years ago by SES 8.6k

Ram · Answer 2 · 2015-10-30

1

Entering edit mode

9.2 years ago

SES 8.6k

Challenge accepted. Here is the most minimal example I could come up with:

It runs with any Perl (I tested 5.6 and 5.20) and it's pretty fast too. Well, it is as fast as the sed script and much faster than bioperl (7 sec. for human chr 1 vs. >90 sec. with Bio::SeqIO). I should note that the sed script doesn't parse all GenBank files correctly (like long sequences), and it won't warn if there is no sequence for the record. I guess the lesson there is that it usually isn't worth saving a few seconds of run time if you have to debug later. For a one-off, I might do this but it was mostly for fun while I wait for a job to finish. For a pipeline that is important, and where you need more than just the sequence, I would probably use bio* something to handle all the corner cases.

Usage:

perl biostars63769.pl < sequence.gb > sequence.fas

And, this was against the rules, but I have to say, bioperl is tough to beat (though the above will obviously be way faster):

perl -MBio::SeqIO -e '$seqin = Bio::SeqIO->new(-fh => \*STDIN, -format => 'genbank'); $seqout = Bio::SeqIO->new(-fh => \*STDOUT, -format => 'fasta'); while ($seqobj = $seqin->next_seq) { $seqout->write_seq($seqobj) }' < sequence.gb > sequence.fas

ADD COMMENT • link updated 5.1 years ago by Ram 44k • written 9.2 years ago by SES 8.6k

0

Entering edit mode

Looks good! But just visually, it looks like it doesn't handle multifasta files. Probably just a small fix?

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 9.1 years ago by Lee Katz ★ 3.2k

0

Entering edit mode

Maybe this is a good fix?

ADD REPLY • link 9.1 years ago by Lee Katz ★ 3.2k

0

Entering edit mode

Do you mean multiple sequences per genbank file/record? It looks like it would be easy to add, as you did, but this is one reason I would probably use a Bio* library in practice because messing with genbank files can get complicated given that there are so many variations. This change is not complicated but it is hard to write a simple parser that works on all genbank files.

ADD REPLY • link 9.1 years ago by SES 8.6k

0

Entering edit mode

Exactly. Most of my genbank files have multiple entries.

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 9.1 years ago by Lee Katz ★ 3.2k