LD-decay in a r2 vs distance(cm) plot
3
7
Entering edit mode
6.7 years ago
adele ▴ 80

I would like to create a LD vs distance(cm) plot in R using an output from PLINK. I have been trying with different --ld-window-kb numbers but I can not plot them in R. Furthermore, I have a question about average r2. In some papers ld_decay is plotted average r2 vs distance and some papers used r2 vs distance. Can everybody explain more?
thanks for your attention

R snp • 26k views
ADD COMMENT
0
Entering edit mode

Could you give an example of those paper?

ADD REPLY
0
Entering edit mode

What is the numbers in column 1 in summary file? Thank you,

ADD REPLY
26
Entering edit mode
6.7 years ago
firestar ★ 1.6k

I've struggled with this too. This was my approach. Suggestions to improve this is welcome.

When you ask plink for LD r2 using this command,

plink --bfile "snp" --r2 --out "snp"

behind the scenes, --ld-window-r2 is set to 0.2 which means snps with r2 value below 0.2 are ignored. Also pairwise comparisons are limited to 1000 kb windows --ld-window-kb. Both of these are unsuitable for LD decay plot. We want pairwise comparisons for all r2 values and pairwise comparisons across all snps across the whole chromosome.

But you mostly likely have too many SNPs to do that. So first thing, thin it down evenly. I used mapthin. It works with the plink bim files.

mapthin -b 20 snp.bim snp-thin.bim

Total number of SNPs in original file: 18731824
Number of SNPs in thinned file: 26952 (0.143883%)
Mean base pair position (in file) between SNPs: 50185.6 bpp

This means keep 20 snps every 10^6 bases and create a new bim. The genome length of my organism is 1 bil bases. If your org has even larger genome, you might want to lower this number. Now that you have a thinned set, you can run plink to compute r2 values between all pairs. Set --ld-window-kb to the length of the longest chromosome in kb. I am not sure what this is --ld-window, but I am told to set it to some stupidly large number.

plink --bfile "snp-thin" --r2 --ld-window-r2 0 --ld-window 999999 --ld-window-kb 8000 --out "snp-thin"

This exports the file snp-thin.ld with all pairwise comparisons and r2 values. You should get something like this

 CHR_A         BP_A             SNP_A  CHR_B         BP_B             SNP_B           R2
     1          359         Var-1-359      1        10377       Var-1-10377            1
     1          359         Var-1-359      1        30391       Var-1-30391            1
     1          359         Var-1-359      1        40409       Var-1-40409            1
...

This could be a fairly big file depending on how many thinned snps you had. It was around 5GB for me. All we want is the distance between each pair and the r2 value. We do that to create a summary file.

cat snp-thin.ld | sed 1,1d | awk -F " " 'function abs(v) {return v < 0 ? -v : v}BEGIN{OFS="\t"}{print abs($5-$2),$7}' | sort -k1,1n > snp-thin.ld.summary

Summary file looks like

1       0.157895
1       0.489362
1       1
1       1
1       1
...

Now plotting in R. Read the table into R, add some column names.

library(dplyr)
library(stringr)
library(ggplot2)

dfr <- read.delim("snp-thin.ld.summary",sep="",header=F,check.names=F,stringsAsFactors=F)
colnames(dfr) <- c("dist","rsq")

This data is going to be quite noisy. I found it best to categorise distances into intervals of a fixed length and compute mean r2 within each interval. Below, I group into 10 Kb intervals.

dfr$distc <- cut(dfr$dist,breaks=seq(from=min(dfr$dist)-1,to=max(dfr$dist)+1,by=100000))

Then compute mean and/or median r2 within the blocks

dfr1 <- dfr %>% group_by(distc) %>% summarise(mean=mean(rsq),median=median(rsq))

A helper step to get mid points of our distance intervals for plotting.

dfr1 <- dfr1 %>% mutate(start=as.integer(str_extract(str_replace_all(distc,"[\\(\\)\\[\\]]",""),"^[0-9-e+.]+")),
                        end=as.integer(str_extract(str_replace_all(distc,"[\\(\\)\\[\\]]",""),"[0-9-e+.]+$")),
                        mid=start+((end-start)/2))

Now, plot.

  ggplot()+
  geom_point(data=dfr1,aes(x=start,y=mean),size=0.4,colour="grey20")+
  geom_line(data=dfr1,aes(x=start,y=mean),size=0.3,alpha=0.5,colour="grey40")+
  labs(x="Distance (Megabases)",y=expression(LD~(r^{2})))+
  scale_x_continuous(breaks=c(0,2*10^6,4*10^6,6*10^6,8*10^6),labels=c("0","2","4","6","8"))+
  theme_bw()

This should produce something like this:

enter image description here

You could also do this on a smaller scale by limiting pairwise comparisons to smaller window size thereby getting more finer on the lower end. Perhaps this is what you want. Otherwise, you look at this and think this is not as interesting as you thought it would be. Then, what you really want to see is the LD block length distribution. That continues as a new answer as I am out of space.

ADD COMMENT
1
Entering edit mode

For fast LD computation directly from a VCF file, see here.

ADD REPLY
0
Entering edit mode

Thanks for your comment. That was helpful.

ADD REPLY
0
Entering edit mode

Thanks for sharing that! It helped me a lot! Let me a question please? 1) You created mid column in the command:

dfr1 <- dfr1 %>% mutate(start=as.integer(str_extract(str_replace_all(distc,"[\\(\\)\\[\\]]",""),"^[0-9-e+.]+")),
                        end=as.integer(str_extract(str_replace_all(distc,"[\\(\\)\\[\\]]",""),"[0-9-e+.]+$")),
                        mid=start+((end-start)/2))

But it was not used during the plotting. Am i right? If so, why we need this column?

Thanks, Denis

ADD REPLY
1
Entering edit mode

You are right. I don't use it here. The mid variable is just used to find the mid point of the interval windows. When plotting, you could use mid instead of start. That was my plan, but then it turns out, it doesn't make much difference for this particular plot.

ADD REPLY
0
Entering edit mode

Hi, This post is very helpful. I had a doubt with the following code:

cat snp-thin.ld | sed 1,1d | awk -F " " 'function abs(v) {return v < 0 ? -v : v}BEGIN{OFS="\t"}{print abs($5-$2),$7}' | sort -k1,1n > snp-thin.ld.summary

I was wondering whether the "cat" is a function in plink/mapthin or it refers to separate software? Thanks in advance. Manju

ADD REPLY
0
Entering edit mode

Hi! cat is linux shell command. Please check: http://www.linfo.org/cat.html

ADD REPLY
0
Entering edit mode

Thnaks for sharing this tutorial. I have a question about mapthin: I used .bim files to thin my data, but when I run PLINK to do LD, I get an error message ( Invalid .bed file size (expected 22158 bytes)). That means my .bed file is steal unchanged (contains all genotypes).

My question is : How to extract genotypes for SNPs contained in thinned bim file from bed file? to get a bed file with genotypes for thinned markers.

Thanks, Abdel

ADD REPLY
0
Entering edit mode

This post is great!

I am getting into the same issue, did you find a solution? Would you mind to share?

Thanks,

Beatriz

ADD REPLY
1
Entering edit mode

I had the same issue as you, and I think I figured out a way to re-integrate the mapthin output.

Assuming that, as described above, you ran mapthin as "mapthin -b 20 snp.bim snp-thin.bim"

You can then run the following to reformat the sites in your original bfile according to the mapthin output: plink --bfile snp --extract snp-thin.bim --recode --make-bed --out snp-thin2

You'll then get updated files and should be able to use --bfile snp-thin2 for plink analyses.

I hope this helps!

ADD REPLY
0
Entering edit mode

I ran something like that:

vcftools --gzvcf output.bcf.flts.vcf.gz --plink --out output_in_plink1
plink --noweb --file output_in_plink1 --make-bed --out plinkedout
plink --bfile plinkedout --extract plinkedout_thin.bim --recode --make-bed --out plinkedout_thin
plink --bfile "plinkedout_thin" --r2 --ld-window-r2 0 --ld-window 999999 --ld-window-kb 8000 --out "snp-thin"
cat snp-thin.ld | sed 1,1d | awk -F " " 'function abs(v) {return v < 0 ? -v : v}BEGIN{OFS="\t"}{print abs($5-$2),$7}' | sort -k1,1n > MYLD.ld.summary

And ran the R script.

But I had that as a result... where the y scale is so small... not sure what is happening here...

ADD REPLY
0
Entering edit mode

Hi, Thank you for such a nice tutorial. can you please look into this:

after dfr1 <- dfr %>% group_by(distc) %>% summarise(mean=mean(rsq),median=median(rsq))

i get a warning message: Warning message: Factor distc contains implicit NA, consider using forcats::fct_explicit_na

and in the end the graph is empty. Can you please tell me what can be the problem and how to go forward from here? Thank you in advance.

ADD REPLY
0
Entering edit mode

Hello,

Thank you so much for this tutorial - it is very helpful.

I was wondering if there is a way to run plink --r2 on a population level? I.e. One line on the decay plot per set on individuals?

Thank you in advance. Eve

ADD REPLY
8
Entering edit mode
6.7 years ago
firestar ★ 1.6k

For LD block length distribution, I ran this on the thinned SNP set.

plink --bfile "snp-thin"  --blocks --blocks-max-kb 200 --out "snp-thin"

This produces snp-thin.blocks.det. Then some R.

dfr <- read.delim("snp-thin.blocks.det",sep="",header=T,check.names=F,stringsAsFactors=F)
colnames(dfr) <- tolower(colnames(dfr))

# ld block density
p <- ggplot(dfr,aes(x=kb))+
  geom_density(size=0.5,colour="grey40")+
  labs(x="LD block length (Kb)",y="Density")+
  theme_bw()

ggsave("snp-thin-ld-blocks.png",p,height=8,width=8,units="cm",dpi=250)

enter image description here

Now you see LD block length distribution for your set of samples.

ADD COMMENT
1
Entering edit mode

I had so much fun learning with you. Your answers were very insightful and interactive, so thank you for sharing your wisdom with me.

ADD REPLY
2
Entering edit mode
6.7 years ago
hewm2008 ▴ 50

I thinks this software(PopLDdecay) can help you to draw LDdecay Figure directly, https://github.com/BGI-shenzhen/PopLDdecay

The simplest usage

./bin/PopLDdecay    -InVCF  SNP.vcf.gz  -OutStat LDdecay   [#other para:   -SubPop    GroupA_sample.list]
  perl  bin/Plot_OnePop.pl  -inFile   LDdecay.stat.gz  -output  Fig

Use Data of this web site to test follow software ( ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ ) only with two based site in chr22 (minimal SNP database) of the 1000 Genomes Project. ALL the pair-wise SNP R^2 is the same.

| 板本号        |     平均内存      |  核心程序计算时间       | 输入格式转化和结果统计时间        | 输出结果|
| version       |    Average memory |  Core calculation CPUs  | Format conver&Statistics Time CPU| result size|

|Plink 1.07     |    1.4G           |     680min              | 5 min+45min     |         54G
|Plink 2.0      |    18.807G        |      25min              | 5 min+45min     |         54G
|Haploview 4.2  |    95.760G        |     3904min             | 5 min+45min     |         54G
|PopLDdcay 3.30 |    2.5G           |     200min              |  0min           |         4.1M

if all chr (1-22) were calculated, it will take too much hard disk space to stored the temporary result.

ADD COMMENT
0
Entering edit mode

Dear Sir, I sent you an email (hewm2008@gmail.com,hewm2008@qq.com) with an issue on PopLDdecay. Could you please have a look? Manju

ADD REPLY
0
Entering edit mode

I have already sent back the mail. any question are welcome.

ADD REPLY

Login before adding your answer.

Traffic: 2535 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6