I have a variant dataset for 95 plant genomes based on 30x coverage whole-genome resequencing. I use the --indep-pairwise command in PLINK 1.9 to filter variants based on LD within a window of 20,000 variants and run PCA's. My PCA plots show clear differentiation between groups of individuals so I want to assess population structure and admixture in ADMIXTURE.
Running ADMIXTURE 1.3 for various numbers of populations (k = 1:9) the populations recovered match those found in the PCA. However, which k has the lowest cross-validation error is very much dependent on the LD threshold used for filtering. Below is a plot showing cross-validation errors for k=1:9 based on three differently filtered datasets. As you can see, with the largest dataset based on an LD threshold of R^2 = 0.5 (n=114,953 variants), the optimal k = 1. The smaller dataset based on r^2 = 0.3 (n=14,574) has optimal k = 2. The smallest dataset based on R^2 < 0.1 (n=2,034) has optimal k = 4. Also, overall cross-validation errors go up with reducing LD thresholds.
I am not a population geneticist but I find this highly surprising. Why would including more variants in LD reduce both the cross-validation errors and the population structure? I would expect that linked variants would boost any population-level signal rather than reduce it. If anyone can explain these patterns to me I would be much obliged.
More practically, how can I determine the optimal filtering and k for my dataset?
Many thanks in advance.
Please see How to add images to a Biostars post to add your images properly. You need the direct link to the image, not the link to the webpage that has the image embedded (which is what you have used here)
LD < 0.5 (and also LD<0.3) are probably too liberal to provide sensible results. You can refer to the manual where authors suggest to prune for LD (https://dalexander.github.io/admixture/admixture-manual.pdf). Also, the selection of the optimal number of K can be performed by using the cross-validation procedure in admixture. This is also explained in the user manual.
Thanks for the suggestions but my graph was already showing cross-validation errors after pruning for LD (with different thresholds). My point was that optimal k changes depending on the threshold used making a choice for threshold and k relatively arbitrary.