Ld Mapping Using A ~30G File With Perl Script
2
2
Entering edit mode
12.1 years ago
J.F.Jiang ▴ 930

Hi all,

Maybe I should not to post this kind of question here, but I am just used to asking question here, so I hope you can give me the right answer.

I have generated a ~30G whole chromosomes LD file using PLINK and want to find all LD snps using a query snp list.

For this process, I used a perl script with the command below:

open(FILE,"$path/$filename")
while(<FILE>)
{
     do LD mapping!
}

Since the raw LD file is so big, it seems that the perl will occupy most memory of my machine.

script:

foreach $tmp1(@contents)
{
    if($tmp1=~/xxxxxx/)
    {
        print "processing FILE $tmp1......"."\n";
        open(FILE2,"$tmp1") or die "can not open FILE2!";

        while(<FILE2>)
        {
            @line = split(/\s+/,$_);
            #print $line[2]."    ".$line[5]."\n";
           $hash1{$line[3]} = $line[3]."\n".$line[6]."\n";
            $hash2{$line[6]} = $line[3]."\n".$line[6]."\n";
           #$eachline =<FILE2>;
       }
       close FILE2;
    }

So how can I optimize my script to run faster?

Thanks for all!

ld perl • 3.5k views
ADD COMMENT
0
Entering edit mode

Actually that open/while structure doesn't read the whole file in memory, there is something else in your code that is filling it, can you post the complete script?

ADD REPLY
0
Entering edit mode

so the problem is because you are creating a gigantic hash, do you need all in memory? what are you doing after loading the data?

ADD REPLY
0
Entering edit mode

Yes, I must build this hash, because I want to find ALL LD snps in this computed LD data. IF I do not build the hash table, It will be disaster for a permutation process.

ADD REPLY
3
Entering edit mode
12.1 years ago

As you imply, it is likely that your script is running slowly because you are reading a huge file into a (memory inefficient) perl hash. You are probably consuming all RAM on your machine and then your script slows to a crawl when it has to use swap memory. One solution you might consider is to try BerkelyDB. I believe the BerkeleyDB perl module is available by default in modern perl installations. Your hash can be created in much the same way as now except that it will exist in an efficient file-based database.

use BerkeleyDB;

# Make %hash1 an on-disk database stored in database.dbm. Create file if needed
tie my %hash1, 'BerkeleyDB::Hash', -Filename => "database.dbm", -Flags => DB_CREATE or die "Couldn't tie database: $BerkeleyDB::Error";

#Read through input file and set contents to hash as normal
while(<FILE2>){
  my @line = split(/\s+/,$_);
  $hash1{$line[3]} = $line[3]."\n".$line[6]."\n";
}
close(FILE2);

#Retrieve information from hash as normal
for my $key (keys %hash1) {
  print "$key -> $hash1{$key}\n";  # iterate values
}

#Once finished with it, you can delete the BerkeleyDB data structure and corresponding file on file system
%hash1 = ();
unlink("database.dbm");
ADD COMMENT
0
Entering edit mode

Thank you for your reply, though the BerkelyDB is not installed in our server and I have not rights to install the module. But together with Zev's answer, I think maybe it is the best way to build such a high hash into a file system.

ADD REPLY
2
Entering edit mode
12.1 years ago

I ran into this same problem. It is not trivial. I solved the problem by using the Perl Data Language (PDL). I built matrixes rather than a hash containing the chromosome, position and allelic information. PDL has C calls so it is very fast.

I have used it on:

1KG data and a population of pigeons.

You are welcome to use my code if you can format your data into the CDR format which is part of the VAAST pipeline.

It is also threaded for speed.

ADD COMMENT
0
Entering edit mode

Thank you for your reply, good hint!

ADD REPLY

Login before adding your answer.

Traffic: 2311 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6