Do you happen to have your data in VCF format prior to the format conversion? If so, PLINK 1.9 has been an excellent all-in-one tool for GWAS that I've found success using.
Otherwise, I'd think allele frequencies are the way to go in calculating pairwise LD. I'm not sure I fully understand the specifics of the format your data is in, but the way I'd tackle the problem would be:
a) Find a way to read in your data that preserves the frequencies needed for each site. If we're to stick to the idea of allele frequencies, perhaps their final iteration could be a list of vectors of size 4 (I'm assuming biallelic SNPs) containing chromosome number + site + frequencies of the major and minor allele (i.e. c('chr1', 24, 0.58, 0.42)
for position 24 of chr1 with a major allele freq of 0.58 and a MAF of 0.42). Or perhaps a data frame with each row representing a site and each column representing one of those frequency values? I'm not particularly an expert at what the most efficient method might be here, but as long as you figure out something along those lines.
b) Once you've settled on a format, create another function that would calculate parameters needed for LD calculations. The intricacies of this obviously boil down to how your data is organized, but a simple way of conceptualizing D, for instance, might be
dcalc <- function(p11, p22, p12, p21) {
LHS <- p11 * p22
RHS <- p12 * p21
out <- LHS - RHS
return(out)
}
Which would itself necessitate a function to actually parse your data and calculate the parameters I've denoted as p11 etc. This function could take in pairs of records and return said parameters for use in the actual pairwise LD calculation functions.
c) Once these are finalized, load in your data and use nested loops to perform pairwise comparisons. This is ghoulish amounts of slow, I know, and especially so in the loop-averse world of R, but I'm trying to stick to the most accessible way to do this. Let's say you have your data in two separate lists denoting two regions you want to calculate pairwise LD between. Here's an extremely rough idea of what that might look like:
for (a in 1:length(region1)) {
for (b in 1:length(region2)) {
freqs <- freqscalc(region1[a], region2[b]) # assuming this fxn calculates p11 etc
print(dcalc(freqs[1], freqs[2], freqs[3], freqs[4]))
}
}
Obviously you wouldn't want to use print
here in something like RStudio at the risk of polluting your console, so it might be worthwhile to make an executable R script out of this and run it in the shell in order to use Unix redirects for saving your output to a file.
Something along these lines can be used to calculate r2 and D' as well - all it necessitates is actually coding up the functions.
Finally, LD decay could be calculated by having your functions also return the distance between sites they're calculating LD for, and outputting that in addition to the actual pairwise LD statistic between those sites. These values could then be saved to a data frame object, for instance, with each row containing a) a value for distance and b) the amount of LD. Once you've obtained that for all desired comparisons, you could use that data frame to simply plot LD ~ distance.
Hope this helps!