Question

Calculating Ld In Entire Chromosome From Ensembl Database ?

2

Entering edit mode

13.9 years ago

Tg ▴ 320

Hi, I tried to calculate LD from ensembl-variation database with perl-api. I want a LD from entire chromosome for SNP with at most 200k for others.

The output is something like this

Output.chr1
rsid,rsid,r_square
10045830,10036350,0.8
10045830,10076494,0.4

Here's my current code http://pastebin.com/F0xNJwA6. I partition chromosome into chunk and calculate SNP on each chunk.

The code run ok, but a little bit of problem 1. I got some error on "segmentation fault". My guess from google is storable in perl. 2. It use alot of memory. Around 6-7GB.

Is there any better way to do this ? I'm pretty new to perl and these api.

ensembl linkage api • 3.7k views

ADD COMMENT • link updated 13.9 years ago by Larry_Parnell 16k • written 13.9 years ago by Tg ▴ 320

score 5 · Answer 1 · 2011-05-04

5

Entering edit mode

13.9 years ago

brentp 24k

One thing you could try is to decrease your bin-size. Another would be to avoid de-referencing when possible::

@{$ldFeatureContainer->get_all_r_square_values}

is going to make a copy. And that's probably the most memory intensive data-structure in your program, so not making a copy will help you in this case. So instead, maybe something like:

for ($i=0; $i <= $#{$ldFeatureContainer->get_all_r_square_values}; $i++) { 
    my $r->square = $ldFeatureContainer->get_all_r_square_values->[$i];
    ...
}

will keep the big Container as a reference.

ADD COMMENT • link 13.9 years ago by brentp 24k

1

Entering edit mode

thanks for a tip about de-reference, I didn't knew that. I will try that.

ADD REPLY • link 13.9 years ago by Tg ▴ 320

0

Entering edit mode

Ok, I tried that, but dereference a hash seems to be like 100x faster.

ADD REPLY • link 13.9 years ago by Tg ▴ 320

Ram · Answer 2 · 2011-05-04

Decreasing bin size is important (+1 to Brent) and relevant. LD falls off in the human genome to negligible values (r^2 < 0.1) at between 30 - 100 kbp from the test marker. If you have bins of say 200 kbp, you should see that these are divided into more than one LD block. If not, then increase the size of that bin and rerun - or fuse to the end of the neighboring bin.

Keep in mind that LD strength is rather variable across the human genome. A map of recombination hotspots (see papers by Gil McVean, or http://www.stats.ox.ac.uk/~mcvean/OXSTAT/GeneticMap_b36/hotspots_b36.txt http://ftp.hapmap.org/recombination/2006-10_rel21_phaseI+II/hotspots/) may be a good place to start dividing your bins.