Question

Qtl Analysis Using Principal Components As Covariates

2

Entering edit mode

11.1 years ago

John ▴ 70

Hi all,

I am wondering what is the reason for including principal components as covariates in QTL analysis? And how to determine the number of PCs to include? For example, the following is a short text from a paper. I understand that by including imputation status, we can adjust for potential biases of imputation. But what do PCs eliminate? Thank you in advance!

The details of sample sets, data filtering and normalization are discussed above. Briefly, we did transcriptome QTL mapping separately for European (n=373) and Yoruba (n=89) populations. We used genetic variants with MAF>5% in either EUR or YRI <1MB from transcription start site, with covariates of imputation status (0|1), PCs 1-3 for Europeans and PCs 1-2 for Yoruba.

RNA-Seq eqtl • 5.7k views

ADD COMMENT • link updated 6.7 years ago by GouthamAtla 12k • written 11.1 years ago by John ▴ 70

1

Entering edit mode

Check out this paper: "Principal components analysis corrects for stratification in genome-wide association studies."

ADD REPLY • link 11.1 years ago by matted 7.8k

0

Entering edit mode

Thanks for the paper!

ADD REPLY • link 11.1 years ago by John ▴ 70

score 1 · Answer 1 · 2013-11-19

It would help if you provided the citation, but most likely the authors are attempting to minimize the effects of cryptic (i.e. unwanted and unplanned) genetic diversity as a confounder with their eQTL study. If you mean to sample people from a single population and perform an association test against the genotypes from that population, you'd like the only thing affecting the dependent variable to be the genotype and other "official" covariates. However, in a population-based sample you can get subgroups of subjects who have systematic differences in their genetic structure. Let's say you have patients from the North and from the South, and patients within a geographic group are more similar to each other genetically than they are to patients in the other group. Some of the time these differences will co-vary with your dependent variable, misleading you about the effects of a given genotype. One way this can happen is if the two populations have different minor allele frequencies for a given locus or set of loci, and within these populations there is no association with the dependent variable. However, if the variable is associated in some way with the cryptic populations, you might think the specific genotypes are associated with the variable instead of the populations as a whole. Another case is where you think you have patients from a single ethnic background (and therefore with a genetic background that has a given degree of similarity) but there is a minority population that contains significant genetic contribution from some other ethnicity. Usually you'd like to remove those effects as best you can in order to test only the effects of genotype on your dependent variable.

The PCA in this case is an attempt to account for the greatest sources of undesired variance in the genotype data, thus reducing the effect of cryptic diversity. You would probably test empirically for the "correct" number of PCs to adjust for; I don't know if there is an established dogma about this, or if it's just part of the practice of genetic epidemiology that you would look for PCs that appear to be affecting the analysis and attempt to remove them.