Question

Minimize blast extraction time

0

Entering edit mode

9.0 years ago

Eva_Maria ▴ 190

I wrote a Perl programme to exact alignment from blast tab limited out put. it works perfectly for small inputs but while trying with huge data set (more than 4 gb) about 4-5 hour taken to process a data set. I want to extract more than 4000 data sets and its not possible to do with this programme. kindly tel alternation for this.

Perl programme

use warnings;

print "Enter Your BLAST result file name:\t";
chomp($blast = <STDIN>);     
print "\n";

print "Enter Blast database file name:\t";
chomp($database = <STDIN>);   
print "\n";

open IN,"$blast" or die "Can not open file $blast: $!";

@ids = ();
@seq_start = ();
@seq_end = ();

while(<IN>){

@feilds = split("\t",$_);
push(@ids,$feilds[1]);
push(@seq_start,$feilds[6]);
push(@seq_end,$feilds[7]);
}
close IN;

open OUT,">Result.fasta" or die "Can not open file $database: $!";


for($i=0;$i<=$#ids;$i++){

($sequence)    = &block($ids[$i]);

($idline,$sequence) = split("\n",$sequence);

    if($seq_start[$i] <= 100){
    $pos_Start = 0;
    }
    else{
    $pos_Start = $seq_start[$i]-101;
    }

    $pos_end = $seq_end[$i]+100;
    if($pos_end >= length($sequence)){
    $pos_end = length($sequence);
    }    

$seqlen = $pos_end - $pos_Start;

$Nucleotides = substr($sequence,$pos_Start,$seqlen);

$Nucleotides =~ s/(.{1,60})/$1\n/gs;

print OUT "$idline\n";
print OUT "$Nucleotides\n";    
}
print "\nExtraction Completed...";

sub block{

$id1 =shift;
print "$id1\n";
$start = ();

open IN3,"$database" or die "Can not open file $database: $!";

$blockseq = "";
while(<IN3>){

    if (($_ =~ /^>/)&&($start)){

    last;
    }

    if (($_ !~ /^>/)&&($start)){

    chomp;
    $blockseq .= $_;
    }

    if (/^>$id1/){

    $start = $.-1;
    $blockseq .= $_;
    }
}
close IN3;

return($blockseq);
}

Perl • 1.9k views

ADD COMMENT • link updated 2.5 years ago by Ram 44k • written 9.0 years ago by Eva_Maria ▴ 190

2

Entering edit mode

Another note, be careful with claims like it works perfectly for small inputs. Does it really? You are feeding blast coordinates to substr, blast coordinates are 1 based while substr coordinates are 0 based.

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 9.0 years ago by Michael 55k

1

Entering edit mode

it looks to me you have re-invented a broken version of blastdbcmd http://www.ncbi.nlm.nih.gov/books/NBK279689/ or samtools faidx http://www.htslib.org/doc/samtools.html

ADD REPLY • link 9.0 years ago by Pierre Lindenbaum 164k

1

Entering edit mode

This is so inefficient that it's hard to fix, and so much over complicated code that you should better start from scratch after learning some perl basics such as use of hashes. Simply define the criteria for sequence extraction first.

ADD REPLY • link 9.0 years ago by Michael 55k

Ram · Answer 1 · 2016-01-04

I think the comments above are unduly harsh when it's easy to not understand what's wrong when working a new skill.

I understand the frustration of what you are doing, trying to apply newly learned perl knowledge to your work and then wanting it to do more than it can. The best way to fix this is to improve your perl skill in general. Quick fixes won't improve speed to the degree that you want, and learning better practices will greatly help you next time.

What I recommend is working through a tutorial on using hashes and determine how you can use them to hold your data. This will greatly improve searching. Also, look at the bioperl documentation as there are ready made commands for loading in different blast formatted datasets and it will reduce the amount of parsing that you need to do.

These are good steps for improving your perl skills, but if you are just looking to get it done, there are tools out there that can do it. Creating your own script is good for future you, but won't get this done quickly as you'll undoubtedly get things wrong along the way.