Question

Perl script running slow with big files

2

Entering edit mode

10.4 years ago

cvu ▴ 180

Hi all,

I've written one perl script, which compares two vcf files and write output in third file. it is running good with small files, but when i am inputting big files in Gbs, my system becomes slow and it would hang.

I'm using dbSNP vcf file as input which is around 9GB.

Please Suggest me something that will make my perl script run faster. This is my first perl script and i'm new to perl.

Any help would be appreciated !!

Thank you !!!!

#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
use List::MoreUtils;


open (FILE, "<", "dbSNP_in.vcf") or die "failed to open: $!\n";
my @array=(<FILE>);
my @CHR;
my @location;
my @rs;
my @ref_n;
my @alt_n;
foreach (@array)
{
    chomp;
    my ($chrom, $pos, $id, $ref, $alt, $qual, $filter, $info) = split(/\t/, $_);

push @CHR, $chrom;
push @location, $pos;
push @rs, $id;
push @ref_n, $ref;
push @alt_n, $alt;
}

open (FILE1, "<", "trial_rs.vcf") or die "failed to open: $!\n";
my @array1=(<FILE1>);

open (OUT,">trial_output.vcf");
my @columns;

foreach (@array1)
{
    chomp;
    @columns=split(/\t/, $_);
    my $i;
    for ($i=0; $i<@array; $i++)
    {
        if (($columns[0] eq $CHR[$i]) and ($columns[1] eq $location[$i]) and ($columns[3] eq $ref_n[$i]) and ($columns[4] eq $alt_n[$i]))
        {
            $columns[2]=$rs[$i];
        }

    }

print OUT join("\t", @columns), "\n";
}

perl SNP snp alignment Assembly • 13k views

ADD COMMENT • link updated 3.0 years ago by Ram 44k • written 10.4 years ago by cvu ▴ 180

0

Entering edit mode

"Please Suggest me something that will make my perl script run faster. This is my first perl script and i'm new to perl."

It seems obvious that one needs to post your script for this purpose, doesn't it?

ADD REPLY • link 10.4 years ago by Michael 55k

0

Entering edit mode

This is the script

... moved ...

ADD REPLY • link updated 10.4 years ago by Michael 55k • written 10.4 years ago by cvu ▴ 180

2

Entering edit mode

10.4 years ago

Vivek ★ 2.7k

I'd suggest using Tabix to index the larger file and use the tabix perl module to query it for each locus in the smaller VCF file. That's likely the most optimal solution in terms of time and memory.

http://samtools.sourceforge.net/tabix.shtml

ADD COMMENT • link updated 5.1 years ago by Ram 44k • written 10.4 years ago by Vivek ★ 2.7k

1

Entering edit mode

10.4 years ago

ole.tange ★ 4.5k

There are better answers that address your specific issue. But they do not answer the general problem: What do you do if your perl script is slow?

You profile your code. NYTProf + kcachegrind are good tool to figure out which parts of your code needs attention:

perl -d:NYTProf testme.pl; 
nytprofcg; 
kcachegrind nytprof.callgrind

When you cannot make that faster, see if you can somehow parallelize the code. Can you divide your input into chunks, compute each chunk individually and merge the output when you are done? This technique is called map-reduce.

For smaller jobs (requiring fewer than 1000 CPUs) GNU Parallel can often be of help in doing the chunking of input and running of jobs. You just need to merge the final output.

If you still need more speed, you will need to look at a compiled language such as C++. Again you will use the profiler to figure out which parts of your code need the speedup. And instead of rewriting everything into C++ you can address the small part of the code that is the bottleneck. The good part about this is that much of error handling and corner cases often can be left to Perl.

ADD COMMENT • link updated 3.0 years ago by Ram 44k • written 10.4 years ago by ole.tange ★ 4.5k

Ram · Accepted Answer · 2014-07-16

8

Entering edit mode

10.4 years ago

Michael 55k

The problem is this:

my @array=(<FILE>); ## slurp a whole Terabyte into RAM

Do not read a file like this, unless it is very small or your memory is huge, instead use:

while (<FILE>) { # read each line, then forget about it
    chomp;
    split /t/;
    ....
}

In addition your code is extremely inefficient:

You are iterating over all entries of a huge array for each line of a huge file. Hashes in Perl provide constant time access to elements given the hash key. Use a hash of hashes to store your filter table.

You seem to filter a large file by a small file, therefore:

process the small file first (using while construct), parse the small file into a hash using the location as key of a nested hash
process the large file as above and look up each of the entries in the hash created before

As a result you will only need as much memory as is required to store the small file.

ADD COMMENT • link updated 3.0 years ago by Ram 44k • written 10.4 years ago by Michael 55k

0

Entering edit mode

I guess I should have refreshed the page before answering, since my answer just echoes what you already wrote!

ADD REPLY • link 10.4 years ago by Devon Ryan 104k

0

Entering edit mode

Ok i'll try this way.

Thanks a lot Michael Dondrup !!!!!!

ADD REPLY • link 10.4 years ago by cvu ▴ 180

Ram · Accepted Answer · 2014-07-18

3

Entering edit mode

10.4 years ago

Kenosis ★ 1.3k

Consider the following refactoring:

use strict;
use warnings;
use autodie;

my ( @array, %hash );

open my $FILE, "<", "dbSNP_in.vcf";

while (<$FILE>) {
    $hash{"@array[0, 1, 3, 4]"} = $array[2] if @array = split /\t/, $_, 6;
}

close $FILE;

open my $OUT,   ">", "trial_output.vcf";
open my $FILE1, "<", "trial_rs.vcf";

while (<$FILE1>) {
    my @columns = split /\t/;

    $columns[2] = $hash{"@columns[0, 1, 3, 4]"}
      if exists $hash{"@columns[0, 1, 3, 4]"};

    print $OUT join "\t", @columns;
}

close $OUT;
close $FILE1;

autodie is used to trap all I/O errors.
chomp is not needed within either loop.
splitting only the amount needed is faster than splitting all.
An interpolated array slice (@array[0, 1, 3, 4], which produces a string comprised of elements 0, 1, 3, 4) is used as a hash key with a corresponding value of array element 2--for use later.
An interpolated array slice, as a hash key, is tested for existence, and $columns[2] is set to that key's corresponding value if that key exists. Doing this, instead of using four eq statements, should speed up the file processing.

Hope this helps!

ADD COMMENT • link updated 3.0 years ago by Ram 44k • written 10.4 years ago by Kenosis ★ 1.3k

0

Entering edit mode

Thanks it helped me a lot !!

ADD REPLY • link 10.4 years ago by cvu ▴ 180

0

Entering edit mode

Are you sure chomp is not needed? Way back in time I learned that it is always good and safe to chomp. What sometimes happens to me when I forget to use chomp is that I get the newline character in the last column, this can yield hard to debug errors if you use a string from the last column.

ADD REPLY • link 10.4 years ago by Michael 55k

0

Entering edit mode

My apologies for not explaining my claim...

In the first while loop, only the first five fields are used; the remaining are discarded, including the last field which contains the record separator that chomp would remove. In the second while loop, the record separator is retained and used later when the fields are joined and printed.

ADD REPLY • link updated 3.0 years ago by Ram 44k • written 10.4 years ago by Kenosis ★ 1.3k