I have obtained several .gb BLAST results that I want to split in different .gb or .fasta files accordingly to genus.
I searched for different ways to do it in the web, but didn't find a way to do it when one has something like 50 genus in the blast.
(to do it for a short number or genus I could use splitgb.pl from http://www.bioperl.org/wiki/HOWTO:SeqIO but it will not be feasible for a large number of genus).
Is there a script of way to do it in a efficient way ?
I've been looking at BioPython also, but what would be the record name in SeqIO for species of genus? I could find examples for size and other characteristics of the sequence but not for taxonomy.
It seems to me that you could just loop through everyone of your 50 genera and do a splitgb.pl for each. Something like this (this code has not been fully tested):
use Bio::SeqIO;
my $usage = "splitgb.pl infile\n";
my $infile = shift or die $usage;
my @genera=("Homo sapiens","Sus scrofa","Mus musculus");
my $inseq = Bio::SeqIO->new(-file => "<$infile",-format => 'Genbank');
foreach my $genus (@genera){
my $outfile = Bio::SeqIO->new(-file => '>$genus\.gb',-format => 'Genbank');
while (my $seqin = $inseq->next_seq) {
if ($seqin->species->binomial =~ m/$genus/) {
$outfile->write_seq($seqin);
}
}
}
You could optimize this so that you only need to loop through $inseq once.
It can be done easily using biopython SeqIQ.
I've been looking at BioPython also, but what would be the record name in SeqIO for species of genus? I could find examples for size and other characteristics of the sequence but not for taxonomy.
You have to learn python properly. Then you can parse any type of file. Look here: http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc83