When I calculate the LD details between a pair of SNPs using following command as an example, why there might be two solutions shown as following (Q1)?
plink --bfile 1000G.genotypes --ld rs10005588 rs12505231
>
Solution #1:
R-sq = 0.000943758 D' = 1
Haplotype Frequency Expectation under LE
--------- --------- --------------------
GA 0 0.000888
AA 0.030815 0.029927
GC 0.028827 0.027939
AC 0.940358 0.941246
In phase alleles are GC/AA
Solution #2:
R-sq = 0.426699 D' = 0.676064
Haplotype Frequency Expectation under LE
--------- --------- --------------------
GA 0.019777 0.000888
AA 0.011038 0.029927
GC 0.009050 0.027939
AC 0.960135 0.941246
In phase alleles are GA/AC
Q2: As described in PLINK, the "Frequency" column represents "observed" frequencies of each haplotype, so why the "Frequencies" in the two solutions are different?
"To inspect the relation between a single pair of variants in more detail, you can use the --ld flag, which displays observed and expected (based on MAFs) frequencies of each haplotype, as well as haplotype-based r2 and D'. When there are multiple biologically possible solutions to the haplotype frequency cubic equation, all are displayed (instead of just the maximum likelihood solution identified by --r/--r2), along with HWE exact test statistics. by PLINK1.9"
Thanks for your comments. But I am not quite understand that why plink doesn't directly count the real haplotypes from samples? since the 1000G genotypes are phased.
PLINK 2.0 —ld does count the real haplotypes.
PLINK 1.x is incapable of that because its core file format can’t represent phase.
Since your "1000G.genotypes" fileset is in PLINK 1.x's format (.bed), even running PLINK 2.0 --ld on it won't work; the phase information was lost during the original conversion to .bed. Use the files at https://www.cog-genomics.org/plink/2.0/resources#1kg_phase3 with PLINK 2.0 instead.
Thanks for your prompt and deep insights!