Minimize blast extraction time
1
0
Entering edit mode
8.9 years ago
Eva_Maria ▴ 190

I wrote a Perl programme to exact alignment from blast tab limited out put. it works perfectly for small inputs but while trying with huge data set (more than 4 gb) about 4-5 hour taken to process a data set. I want to extract more than 4000 data sets and its not possible to do with this programme. kindly tel alternation for this.

Perl programme

use warnings;

print "Enter Your BLAST result file name:\t";
chomp($blast = <STDIN>);     
print "\n";

print "Enter Blast database file name:\t";
chomp($database = <STDIN>);   
print "\n";

open IN,"$blast" or die "Can not open file $blast: $!";

@ids = ();
@seq_start = ();
@seq_end = ();

while(<IN>){

@feilds = split("\t",$_);
push(@ids,$feilds[1]);
push(@seq_start,$feilds[6]);
push(@seq_end,$feilds[7]);
}
close IN;

open OUT,">Result.fasta" or die "Can not open file $database: $!";


for($i=0;$i<=$#ids;$i++){

($sequence)    = &block($ids[$i]);

($idline,$sequence) = split("\n",$sequence);

    if($seq_start[$i] <= 100){
    $pos_Start = 0;
    }
    else{
    $pos_Start = $seq_start[$i]-101;
    }

    $pos_end = $seq_end[$i]+100;
    if($pos_end >= length($sequence)){
    $pos_end = length($sequence);
    }    

$seqlen = $pos_end - $pos_Start;

$Nucleotides = substr($sequence,$pos_Start,$seqlen);

$Nucleotides =~ s/(.{1,60})/$1\n/gs;

print OUT "$idline\n";
print OUT "$Nucleotides\n";    
}
print "\nExtraction Completed...";

sub block{

$id1 =shift;
print "$id1\n";
$start = ();

open IN3,"$database" or die "Can not open file $database: $!";

$blockseq = "";
while(<IN3>){

    if (($_ =~ /^>/)&&($start)){

    last;
    }

    if (($_ !~ /^>/)&&($start)){

    chomp;
    $blockseq .= $_;
    }

    if (/^>$id1/){

    $start = $.-1;
    $blockseq .= $_;
    }
}
close IN3;

return($blockseq);
}
Perl • 1.9k views
ADD COMMENT
2
Entering edit mode

Another note, be careful with claims like it works perfectly for small inputs. Does it really? You are feeding blast coordinates to substr, blast coordinates are 1 based while substr coordinates are 0 based.

ADD REPLY
1
Entering edit mode

it looks to me you have re-invented a broken version of blastdbcmd http://www.ncbi.nlm.nih.gov/books/NBK279689/ or samtools faidx http://www.htslib.org/doc/samtools.html

ADD REPLY
1
Entering edit mode

This is so inefficient that it's hard to fix, and so much over complicated code that you should better start from scratch after learning some perl basics such as use of hashes. Simply define the criteria for sequence extraction first.

ADD REPLY
1
Entering edit mode
8.9 years ago
Daniel ★ 4.0k

I think the comments above are unduly harsh when it's easy to not understand what's wrong when working a new skill.

I understand the frustration of what you are doing, trying to apply newly learned perl knowledge to your work and then wanting it to do more than it can. The best way to fix this is to improve your perl skill in general. Quick fixes won't improve speed to the degree that you want, and learning better practices will greatly help you next time.

What I recommend is working through a tutorial on using hashes and determine how you can use them to hold your data. This will greatly improve searching. Also, look at the bioperl documentation as there are ready made commands for loading in different blast formatted datasets and it will reduce the amount of parsing that you need to do.

These are good steps for improving your perl skills, but if you are just looking to get it done, there are tools out there that can do it. Creating your own script is good for future you, but won't get this done quickly as you'll undoubtedly get things wrong along the way.

ADD COMMENT

Login before adding your answer.

Traffic: 2918 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6