I wrote a Perl programme to exact alignment from blast tab limited out put. it works perfectly for small inputs but while trying with huge data set (more than 4 gb) about 4-5 hour taken to process a data set. I want to extract more than 4000 data sets and its not possible to do with this programme. kindly tel alternation for this.
Perl programme
use warnings;
print "Enter Your BLAST result file name:\t";
chomp($blast = <STDIN>);
print "\n";
print "Enter Blast database file name:\t";
chomp($database = <STDIN>);
print "\n";
open IN,"$blast" or die "Can not open file $blast: $!";
@ids = ();
@seq_start = ();
@seq_end = ();
while(<IN>){
@feilds = split("\t",$_);
push(@ids,$feilds[1]);
push(@seq_start,$feilds[6]);
push(@seq_end,$feilds[7]);
}
close IN;
open OUT,">Result.fasta" or die "Can not open file $database: $!";
for($i=0;$i<=$#ids;$i++){
($sequence) = &block($ids[$i]);
($idline,$sequence) = split("\n",$sequence);
if($seq_start[$i] <= 100){
$pos_Start = 0;
}
else{
$pos_Start = $seq_start[$i]-101;
}
$pos_end = $seq_end[$i]+100;
if($pos_end >= length($sequence)){
$pos_end = length($sequence);
}
$seqlen = $pos_end - $pos_Start;
$Nucleotides = substr($sequence,$pos_Start,$seqlen);
$Nucleotides =~ s/(.{1,60})/$1\n/gs;
print OUT "$idline\n";
print OUT "$Nucleotides\n";
}
print "\nExtraction Completed...";
sub block{
$id1 =shift;
print "$id1\n";
$start = ();
open IN3,"$database" or die "Can not open file $database: $!";
$blockseq = "";
while(<IN3>){
if (($_ =~ /^>/)&&($start)){
last;
}
if (($_ !~ /^>/)&&($start)){
chomp;
$blockseq .= $_;
}
if (/^>$id1/){
$start = $.-1;
$blockseq .= $_;
}
}
close IN3;
return($blockseq);
}
Another note, be careful with claims like it works perfectly for small inputs. Does it really? You are feeding blast coordinates to substr, blast coordinates are 1 based while substr coordinates are 0 based.
it looks to me you have re-invented a broken version of blastdbcmd http://www.ncbi.nlm.nih.gov/books/NBK279689/ or samtools faidx http://www.htslib.org/doc/samtools.html
This is so inefficient that it's hard to fix, and so much over complicated code that you should better start from scratch after learning some perl basics such as use of hashes. Simply define the criteria for sequence extraction first.