Question

Inspect the LD relation between a single pair of SNPs in detail using Plink --ld

0

Entering edit mode

5.6 years ago

Tao ▴ 540

When I calculate the LD details between a pair of SNPs using following command as an example, why there might be two solutions shown as following (Q1)?

plink --bfile 1000G.genotypes --ld rs10005588 rs12505231

>

Solution #1:
   R-sq = 0.000943758    D' = 1

   Haplotype     Frequency    Expectation under LE
   ---------     ---------    --------------------
          GA      0                       0.000888
          AA      0.030815                0.029927
          GC      0.028827                0.027939
          AC      0.940358                0.941246

   In phase alleles are GC/AA

Solution #2:
   R-sq = 0.426699       D' = 0.676064

   Haplotype     Frequency    Expectation under LE
   ---------     ---------    --------------------
          GA      0.019777                0.000888
          AA      0.011038                0.029927
          GC      0.009050                0.027939
          AC      0.960135                0.941246

   In phase alleles are GA/AC

Q2: As described in PLINK, the "Frequency" column represents "observed" frequencies of each haplotype, so why the "Frequencies" in the two solutions are different?

"To inspect the relation between a single pair of variants in more detail, you can use the --ld flag, which displays observed and expected (based on MAFs) frequencies of each haplotype, as well as haplotype-based r2 and D'. When there are multiple biologically possible solutions to the haplotype frequency cubic equation, all are displayed (instead of just the maximum likelihood solution identified by --r/--r2), along with HWE exact test statistics. by PLINK1.9"

plink --ld LD • 1.8k views

ADD COMMENT • link updated 5.6 years ago by chrchang523 11k • written 5.6 years ago by Tao ▴ 540

score 1 · Answer 1 · 2019-11-25

Q1. Humans have two distinct copies of every autosome (setting aside trisomy 21, etc. for now).

Suppose someone is heterozygous for both variants: they have both an A and a G at rs10005588, and they have both an A and a C at rs12505231. Then there are two haplotype possibilities:

One copy of chromosome 4 has both As, and the other copy has a G at rs10005588 and a C at rs12505231.
One copy of chromosome 4 has an A at rs10005588 and a C at rs12505231, and the other copy has a G at rs10005588 and an A at rs12505231.

Many analyses do not care about the difference between these two possibilities--they can ignore 'phase'. However, the --ld computation actually needs a separate frequency estimate for each haplotype. This is where the CubeX reference (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-8-428 ) comes into play: the solutions of that cubic equation are the local maxima of a commonly used haplotype-frequency likelihood function. PLINK 1.9 --ld reports all the local maxima.

With that said, you're probably most interested in the global maximum; in retrospect, PLINK 1.9 --ld should have labeled that, too. PLINK 2.0 --ld corrects this oversight.

Q2. As mentioned above, PLINK usually doesn't know the actual haplotypes for samples which are heterozygous for both variants. (Exception: you are using PLINK 2 .pgen files constructed from a phased VCF. Note that the Eagle (https://data.broadinstitute.org/alkesgroup/Eagle/ ), SHAPEIT4 (https://odelaneau.github.io/shapeit4/ ), and Beagle (https://faculty.washington.edu/browning/beagle/beagle.html ) software packages can produce pretty good phased VCFs when you don't have any phase information from the original genotyping/sequencing process.) The frequency column displays the haplotype-frequency guesses that PLINK is using.