Calculating Ld In Entire Chromosome From Ensembl Database ?
2
2
Entering edit mode
13.7 years ago
Tg ▴ 320

Hi, I tried to calculate LD from ensembl-variation database with perl-api. I want a LD from entire chromosome for SNP with at most 200k for others.

The output is something like this

Output.chr1
rsid,rsid,r_square
10045830,10036350,0.8
10045830,10076494,0.4

Here's my current code http://pastebin.com/F0xNJwA6. I partition chromosome into chunk and calculate SNP on each chunk.

The code run ok, but a little bit of problem 1. I got some error on "segmentation fault". My guess from google is storable in perl. 2. It use alot of memory. Around 6-7GB.

Is there any better way to do this ? I'm pretty new to perl and these api.

ensembl linkage api • 3.5k views
ADD COMMENT
5
Entering edit mode
13.7 years ago
brentp 24k

One thing you could try is to decrease your bin-size. Another would be to avoid de-referencing when possible::

@{$ldFeatureContainer->get_all_r_square_values}

is going to make a copy. And that's probably the most memory intensive data-structure in your program, so not making a copy will help you in this case. So instead, maybe something like:

for ($i=0; $i <= $#{$ldFeatureContainer->get_all_r_square_values}; $i++) { 
    my $r->square = $ldFeatureContainer->get_all_r_square_values->[$i];
    ...
}

will keep the big Container as a reference.

ADD COMMENT
1
Entering edit mode

thanks for a tip about de-reference, I didn't knew that. I will try that.

ADD REPLY
0
Entering edit mode

Ok, I tried that, but dereference a hash seems to be like 100x faster.

ADD REPLY
2
Entering edit mode
13.7 years ago

Decreasing bin size is important (+1 to Brent) and relevant. LD falls off in the human genome to negligible values (r^2 < 0.1) at between 30 - 100 kbp from the test marker. If you have bins of say 200 kbp, you should see that these are divided into more than one LD block. If not, then increase the size of that bin and rerun - or fuse to the end of the neighboring bin.

Keep in mind that LD strength is rather variable across the human genome. A map of recombination hotspots (see papers by Gil McVean, or http://www.stats.ox.ac.uk/~mcvean/OXSTAT/GeneticMap_b36/hotspots_b36.txt http://ftp.hapmap.org/recombination/2006-10_rel21_phaseI+II/hotspots/) may be a good place to start dividing your bins.

ADD COMMENT
0
Entering edit mode

The reason I choose 200k is that I want to compare it with LD from hapmap. They use 200k from test marker. But thanks for the tips.

ADD REPLY

Login before adding your answer.

Traffic: 2049 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6