Question

Parsing Blast Results For Different Genus

0

Entering edit mode

11.9 years ago

afonsomduarte ▴ 20

Dear All,

I have obtained several .gb BLAST results that I want to split in different .gb or .fasta files accordingly to genus.

I searched for different ways to do it in the web, but didn't find a way to do it when one has something like 50 genus in the blast.

(to do it for a short number or genus I could use splitgb.pl from http://www.bioperl.org/wiki/HOWTO:SeqIO but it will not be feasible for a large number of genus).

Is there a script of way to do it in a efficient way ?

Best

blast parsing sequence bioperl biopython • 2.7k views

ADD COMMENT • link updated 11.5 years ago by Joseph Hughes ★ 3.0k • written 11.9 years ago by afonsomduarte ▴ 20

0

Entering edit mode

It can be done easily using biopython SeqIQ.

ADD REPLY • link 11.8 years ago by Pappu ★ 2.1k

0

Entering edit mode

I've been looking at BioPython also, but what would be the record name in SeqIO for species of genus? I could find examples for size and other characteristics of the sequence but not for taxonomy.

ADD REPLY • link 11.8 years ago by afonsomduarte ▴ 20

0

Entering edit mode

You have to learn python properly. Then you can parse any type of file. Look here: http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc83

ADD REPLY • link 11.8 years ago by Pappu ★ 2.1k

score 0 · Answer 1 · 2013-06-04

It seems to me that you could just loop through everyone of your 50 genera and do a splitgb.pl for each. Something like this (this code has not been fully tested):

use Bio::SeqIO;
my $usage  = "splitgb.pl infile\n";
my $infile = shift or die $usage;
my @genera=("Homo sapiens","Sus scrofa","Mus musculus");
my $inseq = Bio::SeqIO->new(-file   => "<$infile",-format => 'Genbank');
foreach my $genus (@genera){
  my $outfile = Bio::SeqIO->new(-file   => '>$genus\.gb',-format => 'Genbank');
  while (my $seqin = $inseq->next_seq) {
     if ($seqin->species->binomial =~ m/$genus/) {
       $outfile->write_seq($seqin);
     } 
   }
}

You could optimize this so that you only need to loop through $inseq once.