Here is a possibly difficult challenge. Or maybe not :)
How would I convert a GenBank file to a Fasta file with only the basic Linux tools at my disposal? These would be Perl 5.10, Bash, Awk, Python 2.6, etc
Here is a possibly difficult challenge. Or maybe not :)
How would I convert a GenBank file to a Fasta file with only the basic Linux tools at my disposal? These would be Perl 5.10, Bash, Awk, Python 2.6, etc
a quick google: http://www.goeker.org/mg/scripts/gbk2fas.sed
Challenge accepted. Here is the most minimal example I could come up with:
It runs with any Perl (I tested 5.6 and 5.20) and it's pretty fast too. Well, it is as fast as the sed script and much faster than bioperl (7 sec. for human chr 1 vs. >90 sec. with Bio::SeqIO
). I should note that the sed script doesn't parse all GenBank files correctly (like long sequences), and it won't warn if there is no sequence for the record. I guess the lesson there is that it usually isn't worth saving a few seconds of run time if you have to debug later. For a one-off, I might do this but it was mostly for fun while I wait for a job to finish. For a pipeline that is important, and where you need more than just the sequence, I would probably use bio* something to handle all the corner cases.
Usage:
perl biostars63769.pl < sequence.gb > sequence.fas
And, this was against the rules, but I have to say, bioperl is tough to beat (though the above will obviously be way faster):
perl -MBio::SeqIO -e '$seqin = Bio::SeqIO->new(-fh => \*STDIN, -format => 'genbank'); $seqout = Bio::SeqIO->new(-fh => \*STDOUT, -format => 'fasta'); while ($seqobj = $seqin->next_seq) { $seqout->write_seq($seqobj) }' < sequence.gb > sequence.fas
Do you mean multiple sequences per genbank file/record? It looks like it would be easy to add, as you did, but this is one reason I would probably use a Bio* library in practice because messing with genbank files can get complicated given that there are so many variations. This change is not complicated but it is hard to write a simple parser that works on all genbank files.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Thank you! I could not actually find this script with my google-fu!
Doesn't work though :-/. Looks like it was hard coded to work with small genbank files, not larger ones like chromosomes.