Question

Should genotype PCA in eQTL analysis be performed using all SNPs? or can be with a subset of SNPs?

0

Entering edit mode

17 months ago

maximal_life ▴ 20

Hello, I'm a beginner in eQTL analysis. Now I'm using a published variant dataset, and I want to detect eQTLs from a list of variants located in enhancer elements. As expected, the variants are a very tiny subset of entire SNPs.

For adjustment, I'm trying to generate genotype PCs with PLINK v1.9. But I wonder if it is okay to do PCA with all genotypes for regression with a subset of genotypes. Should I do PCA with all genotypes so that I can adjust overall population characteristics, or with only genotypes that I will use so that I can prevent excessive, unnecessary adjustment?

Thanks all in advance

genotype eQTL SNP PCA • 633 views

ADD COMMENT • link updated 17 months ago by LauferVA 4.5k • written 17 months ago by maximal_life ▴ 20

score 0 · Answer 1 · 2023-06-27

First, a related point.

Generally, best practice is to perform PCA of study data in the context of a large, global dataset having characteristics similar to one's own. For example, if you were doing a PCA on some Omni 5M chips, and there is a huge study of diverse populations done on the Omni 5M, this might be a good background dataset. The reason for this is that is has been shown that Pc Loadings are more accurately calibrated when the totality of the available population variation is represented.

Now, to address the question.

You should do the PCA on the highest quality set of genotyped variants that you can run with reasonable speed. Increasing the number of variants used eventually has diminishing returns, and in addition increases wall time more than linearly. So, at some point it is not worth including additional variants.

In general, you want to include several variants per linkage block. Much of the information available at a locus can be captured by genotyping just a few positions between two nearby LD hotspots. On the other side of the hotspot, you wont have linkage information, so you'll need to add more for each linkage block as you go. This practice maximizes information gain based on a relatively small number of inputs.

Finally, both PCA and imputation should be run on variants that have similar quality characteristics in cases and controls. This is particularly crucial in meta-analyses when technical/artifactual differences between genotyping chips affect imputation probability across batches, or in cases not controls, etc.

In summary, there are several good reasons NOT to use all variants to do the PCA, including speed, variable genotyping accuracy, and the potential to introduce bias.