Question

Ld Mapping Using A ~30G File With Perl Script

2

Entering edit mode

12.1 years ago

J.F.Jiang ▴ 930

Hi all,

Maybe I should not to post this kind of question here, but I am just used to asking question here, so I hope you can give me the right answer.

I have generated a ~30G whole chromosomes LD file using PLINK and want to find all LD snps using a query snp list.

For this process, I used a perl script with the command below:

open(FILE,"$path/$filename")
while(<FILE>)
{
     do LD mapping!
}

Since the raw LD file is so big, it seems that the perl will occupy most memory of my machine.

script:

foreach $tmp1(@contents)
{
    if($tmp1=~/xxxxxx/)
    {
        print "processing FILE $tmp1......"."\n";
        open(FILE2,"$tmp1") or die "can not open FILE2!";

        while(<FILE2>)
        {
            @line = split(/\s+/,$_);
            #print $line[2]."    ".$line[5]."\n";
           $hash1{$line[3]} = $line[3]."\n".$line[6]."\n";
            $hash2{$line[6]} = $line[3]."\n".$line[6]."\n";
           #$eachline =<FILE2>;
       }
       close FILE2;
    }

So how can I optimize my script to run faster?

Thanks for all!

ld perl • 3.5k views

ADD COMMENT • link updated 8.0 years ago by Biostar 20 • written 12.1 years ago by J.F.Jiang ▴ 930

0

Entering edit mode

Actually that open/while structure doesn't read the whole file in memory, there is something else in your code that is filling it, can you post the complete script?

ADD REPLY • link 12.1 years ago by JC 13k

0

Entering edit mode

so the problem is because you are creating a gigantic hash, do you need all in memory? what are you doing after loading the data?

ADD REPLY • link 12.1 years ago by JC 13k

0

Entering edit mode

Yes, I must build this hash, because I want to find ALL LD snps in this computed LD data. IF I do not build the hash table, It will be disaster for a permutation process.

ADD REPLY • link 12.1 years ago by J.F.Jiang ▴ 930

score 3 · Answer 1 · 2012-10-27

As you imply, it is likely that your script is running slowly because you are reading a huge file into a (memory inefficient) perl hash. You are probably consuming all RAM on your machine and then your script slows to a crawl when it has to use swap memory. One solution you might consider is to try BerkelyDB. I believe the BerkeleyDB perl module is available by default in modern perl installations. Your hash can be created in much the same way as now except that it will exist in an efficient file-based database.

use BerkeleyDB;

# Make %hash1 an on-disk database stored in database.dbm. Create file if needed
tie my %hash1, 'BerkeleyDB::Hash', -Filename => "database.dbm", -Flags => DB_CREATE or die "Couldn't tie database: $BerkeleyDB::Error";

#Read through input file and set contents to hash as normal
while(<FILE2>){
  my @line = split(/\s+/,$_);
  $hash1{$line[3]} = $line[3]."\n".$line[6]."\n";
}
close(FILE2);

#Retrieve information from hash as normal
for my $key (keys %hash1) {
  print "$key -> $hash1{$key}\n";  # iterate values
}

#Once finished with it, you can delete the BerkeleyDB data structure and corresponding file on file system
%hash1 = ();
unlink("database.dbm");

score 2 · Answer 2 · 2012-10-26

2

Entering edit mode

12.1 years ago

Zev.Kronenberg 12k

I ran into this same problem. It is not trivial. I solved the problem by using the Perl Data Language (PDL). I built matrixes rather than a hash containing the chromosome, position and allelic information. PDL has C calls so it is very fast.

I have used it on:

1KG data and a population of pigeons.

You are welcome to use my code if you can format your data into the CDR format which is part of the VAAST pipeline.