As implied by the paper behind the software, i.e. Price et al. (2006), one would directly use eigenvectors ("ancestries of individuals") from EIGENSTRAT as covariates in subsequent linear or logistic regression. However, these eigenvectors are orthonormal, meaning that they all have the same variance. In other words, variation along each axis (eigenvector) is the same, which is not the way it should be. The variation along an axis should be proportional to its associated eigenvalue (lambda). So I think the correct thing is multiply eigenvec_k
by the square root of lambda_k
, and feed it in a regression model as a covariate. On the other hand, it can be shown that eigenvec_k * sqrt(lambda_k)
is just the kth score vector for the individuals if one runs PCA on genotype matrix of size nxp, rather than its transpose, pxn, (n = sample size; p = number of SNPs); the latter is what is used in Price paper.
Although the whole point of performing eigenstrat is to adjust for structure when testing SNP's effect and hence the significance of a SNP is independent of multiplication of sqrt(lambda) mentioned above, I think we need to use the right PC axes. I would be very grateful to any corrections and comment on this topic.