Hi biogirl,
I've found plink simple to use, and very fast for calculating r2. If you want to see where the r2 levels-out, so there's no point in going up to 1Mb. In fruit flies, the r2 seems to level-out after about 200bp although results have been published at much further distances. That having been said, places like the HLA region have long-range LD, so if you're studying that ... go 1MB. You can try on different settings of course. Averaging LD across genomes might miss interesting effects, so something like an LD-manhattan plot might be interesting.
I've posted some of my code using plink for calculating r2, and R for plotting the results. Ideally, the bash script would do the calculations in plink, and then automatically plot the correct graph with R, dependent on user command line input.
Calculate r2 across all samples and regions, 10Kb window, maf > 0.1.
mkdir results
plink1.9 --allow-extra-chr --allow-no-sex \
--bfile ${gname} \
--ld-window-kb 10 \
--r2 \
--maf 0.1 \ ## Because it runs faster, and low-frequency variant statistics 'can' be inaccurate.
--out results/${gname}_ld
Start the R script. Safe 'vanilla' settings, log using tee.
R --vanilla < plot_popgen.R | tee mylog_plot_popgen.txt
R code for formatting and plotting LD decay with distance separated by chromosome.
Packages:
library(tidyr) ## re-arranging data frames.
library(dplyr) ## fast and simple coding.
library(car) ## recoding character variables.
library(ggplot2) ## fancy plots.
Custom theme for plots because the native ggplot ones are not perfect:
theme_ld = function() {theme_bw(base_size = 10) +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_rect(color = "grey", size = 0.5),
axis.line.x = element_line(size = .5, colour = "black"),
axis.line.y = element_line(size = .5, colour = "black"),
strip.background = element_rect(colour = "white", fill = "white"))}
attr(theme_ld(), "complete")
Read in ye olde data. Takes a minute or so to load 24 million rows with 8G ram.
ld = read.table("my_data.ld", header = TRUE)
CHR_A BP_A SNP_A CHR_B BP_B SNP_B R2
1 2 28813 hc_hnmve 2 34345 hc_iwmqn 0.604879
2 2 28813 hc_hnmve 2 35460 hc_tslfx 0.910921
3 2 28813 hc_hnmve 2 36330 hc_bgaeq 0.932208
4 2 28813 hc_hnmve 2 36631 hc_ohnky 0.955198
5 2 34345 hc_iwmqn 2 35460 hc_tslfx 0.600348
Format ye olde data using dplyr ("mutate" means add or change a column).
lddat = ld %>%
sample_frac(.1) %>% ## Take a small, random subset because big data is big.
mutate(kb = (BP_B - BP_A) / 1000 ) %>% ## Get the distance between each marker pair in kilobases.
select(CHR_A, BP_A, BP_B, kb, R2) %>% ## Select only the useful columns.
filter(CHR_A != 6) %>% ## Remove uninteresting chromosomes.
mutate(CHR_A = recode(CHR_A, "1 = 'chrX'; 2 = 'chr2L'; 3 = 'chr2R'; 4 = 'chr3L'; 5 = 'chr3R'")) %>% ## Recode chromosome names zzz.
mutate(mean_pos = ((BP_A + BP_B) /2) / 1e+06) %>%
rename(chromosome = CHR_A) %>%
arrange(chromosome, BP_A) %>% ## Sort.
mutate(logr2 = log(R2)) ## Transform r2 to make it comparable with data published by another group.
And the amazing reformatted data looks like:
chromosome BP_A BP_B kb R2 mean_pos logr2
1 23 184371 184917 0.546 0.352941 0.1846440 -1.04145437
2 23 184917 192656 7.739 0.339545 0.1887865 -1.08014879
3 23 186859 191605 4.746 0.313607 0.1892320 -1.15961467
4 23 188456 191605 3.149 0.942754 0.1900305 -0.05894990
5 23 192656 194397 1.741 0.962043 0.1935265 -0.03869613
6 23 264446 268467 4.021 0.961862 0.2664565 -0.03888429
Plot ye olde data as a scatter plot with overlaid moving average, using ggplot. 'Select' just reduces the input data size. Colour (and group) by chromosome.
ld_genome =
ggplot(lddat %>% select(chromosome, mean_pos, R2), aes(mean_pos, R2, colour = "chromosome")) + ## Select data, specify variables.
geom_point(size = 0.1, alpha = .1) +
stat_smooth(se = TRUE, colour = "blue") +
theme_ld()
Amazing answer, thank you!