I've struggled with this too. This was my approach. Suggestions to improve this is welcome.
When you ask plink for LD r2 using this command,
plink --bfile "snp" --r2 --out "snp"
behind the scenes, --ld-window-r2
is set to 0.2 which means snps with r2 value below 0.2 are ignored. Also pairwise comparisons are limited to 1000 kb windows --ld-window-kb
. Both of these are unsuitable for LD decay plot. We want pairwise comparisons for all r2 values and pairwise comparisons across all snps across the whole chromosome.
But you mostly likely have too many SNPs to do that. So first thing, thin it down evenly. I used mapthin. It works with the plink bim files.
mapthin -b 20 snp.bim snp-thin.bim
Total number of SNPs in original file: 18731824
Number of SNPs in thinned file: 26952 (0.143883%)
Mean base pair position (in file) between SNPs: 50185.6 bpp
This means keep 20 snps every 10^6 bases and create a new bim. The genome length of my organism is 1 bil bases. If your org has even larger genome, you might want to lower this number. Now that you have a thinned set, you can run plink to compute r2 values between all pairs. Set --ld-window-kb
to the length of the longest chromosome in kb. I am not sure what this is --ld-window
, but I am told to set it to some stupidly large number.
plink --bfile "snp-thin" --r2 --ld-window-r2 0 --ld-window 999999 --ld-window-kb 8000 --out "snp-thin"
This exports the file snp-thin.ld
with all pairwise comparisons and r2 values. You should get something like this
CHR_A BP_A SNP_A CHR_B BP_B SNP_B R2
1 359 Var-1-359 1 10377 Var-1-10377 1
1 359 Var-1-359 1 30391 Var-1-30391 1
1 359 Var-1-359 1 40409 Var-1-40409 1
...
This could be a fairly big file depending on how many thinned snps you had. It was around 5GB for me. All we want is the distance between each pair and the r2 value. We do that to create a summary file.
cat snp-thin.ld | sed 1,1d | awk -F " " 'function abs(v) {return v < 0 ? -v : v}BEGIN{OFS="\t"}{print abs($5-$2),$7}' | sort -k1,1n > snp-thin.ld.summary
Summary file looks like
1 0.157895
1 0.489362
1 1
1 1
1 1
...
Now plotting in R. Read the table into R, add some column names.
library(dplyr)
library(stringr)
library(ggplot2)
dfr <- read.delim("snp-thin.ld.summary",sep="",header=F,check.names=F,stringsAsFactors=F)
colnames(dfr) <- c("dist","rsq")
This data is going to be quite noisy. I found it best to categorise distances into intervals of a fixed length and compute mean r2 within each interval. Below, I group into 10 Kb intervals.
dfr$distc <- cut(dfr$dist,breaks=seq(from=min(dfr$dist)-1,to=max(dfr$dist)+1,by=100000))
Then compute mean and/or median r2 within the blocks
dfr1 <- dfr %>% group_by(distc) %>% summarise(mean=mean(rsq),median=median(rsq))
A helper step to get mid points of our distance intervals for plotting.
dfr1 <- dfr1 %>% mutate(start=as.integer(str_extract(str_replace_all(distc,"[\\(\\)\\[\\]]",""),"^[0-9-e+.]+")),
end=as.integer(str_extract(str_replace_all(distc,"[\\(\\)\\[\\]]",""),"[0-9-e+.]+$")),
mid=start+((end-start)/2))
Now, plot.
ggplot()+
geom_point(data=dfr1,aes(x=start,y=mean),size=0.4,colour="grey20")+
geom_line(data=dfr1,aes(x=start,y=mean),size=0.3,alpha=0.5,colour="grey40")+
labs(x="Distance (Megabases)",y=expression(LD~(r^{2})))+
scale_x_continuous(breaks=c(0,2*10^6,4*10^6,6*10^6,8*10^6),labels=c("0","2","4","6","8"))+
theme_bw()
This should produce something like this:
You could also do this on a smaller scale by limiting pairwise comparisons to smaller window size thereby getting more finer on the lower end. Perhaps this is what you want. Otherwise, you look at this and think this is not as interesting as you thought it would be. Then, what you really want to see is the LD block length distribution. That continues as a new answer as I am out of space.
Could you give an example of those paper?
In Additional file 1: Figure S1. https://gsejournal.biomedcentral.com/articles/10.1186/s12711-016-0250-9 and https://www.researchgate.net/figure/Decay-in-linkage-disequilibrium-as-a-function-of-increasing-physical-distance-within-the_fig3_262044877
What is the numbers in column 1 in summary file? Thank you,