Question

PCA of a genotype matrix

4

Entering edit mode

7.9 years ago

oselm ▴ 50

Hi all,

I have a question concerning the methods to perform a Principal Component Analysis of genotype matrix (genotypes coded as 0,1,2) to study the structure of a population.

I have been exposed to two different ways of performing this analsys, and I don't understand the differences between the them.

1st method) Eigenvector of the Individual-correlation matrix

The genotype matrix (M=individuals X N=snps) is used to calculate a correlation matrix by individuals (MxM). Then, the eigenvector of this matrix is calculated. These eigenvectors are used to describe the population structure.

2nd method) Multiplying the Genotype Matrix by the eigenvectors of the SNP-correlation matrix

The genotype matrix is used to calculate a correlation matrix by SNPs (NxN). Then, the eigenvector matrix of this matrix is calculated. The eigenvectors describes how the correlation between SNPs is structured. Then, I multiply the genotype matrix (scaled and centered by SNP) by the eigenvectors. The multiplication is therefore between a (MxN) matrix and a (NxN) and will produce a NxM matrix. Each eigenvectors gives a specific weight to the SNPs, and these weights are multiplied for the genotype of each individual. The results of this multiplication will be used to describe the population structure.

The second method is computationally intensive (correlation matrix NxN can be super heavy to compute) and the R-packages to perform this analysis use the first method. Is there a difference in terms of outcomes? Can you explain them to me?

Thank you

OS

PCA SNP • 6.7k views

ADD COMMENT • link updated 7.9 years ago by anp375 ▴ 190 • written 7.9 years ago by oselm ▴ 50

0

Entering edit mode

7.9 years ago

anp375 ▴ 190

From my understanding:

In the second method, you essentially need an extra step to determine population structure because the eigenvectors describe a difference in each genotype according to its distribution among people. You are producing M N-dimensional coordinates, or plotting M points on an N-dimensional graph, weighted by total difference in genotypes among M people, to show population structure with a per-Variant basis.

In the first method, the eigenvectors already describe a difference in each person according to a difference in genotypes. To do something similar, you would have to multiply the eigenvectors by the transpose of the genotype matrix, giving N points, weighted by total difference in people among N genotypes, in M dimensions.

ADD COMMENT • link 7.9 years ago by anp375 ▴ 190

score 6 · Accepted Answer · 2016-12-21

This is a very good question, and even after working through the linear algebra behind this many times, it can still be difficult to interpret. I'm assuming you're already somewhat familiar with the meaning of eigenvectors, in that they are an orthonormal basis that maximizes the variance.

Method 1 produces a basis for the row space such that the eigenvectors represent a linear combination of individuals for which the variance is maximized. Were you to examine these eigenvectors, it would tell you about the contribution of each individual to maximizing the variance. In this particular case, the eigenvalues scaled by their eigenvectors would give a weighted combination of the individuals' genotypes that gives you a coordinate system in which the population is the most stratified. As you stated, this describes the population structure because you have essentially turned individuals into features and used them to build a coordinate system where you can plot SNPs. This method I believe is more commonly used in fields like quantitative ecology where we desire to see how different or stratified the features (SNPs in your case) are from one another based on a given population.

Method 2 produces a basis for the column space (this is more typical for biological data analysis) such that the eigenvectors represent a linear combination of SNPs for which the variance is maximized. In this case, your coordinate system is comprised of combinations of SNP features where you can plot individuals. What you describe next is the multiplication of the original data matrix onto the eigen-basis for the column space. This is commonly referred to in linear algebra as "projecting" the data into the new coordinate system (orthogonal basis). Imagine that you started with individuals plotted in your standard coordinate system (SNP1 is the x axis, SNP2 is the y axis, and so on). You then do singular value decomposition and project the data back onto the new eigen basis. All you have done is rotated the axes in such a way that the data points have the farthest spread. Now, your x axis is some weighted combination of SNP1, SNP2 ... SNPN such that the data, when projected onto that axis, have maximized variance in that direction. The y axis would be an orthogonal axis to the x axis with the next highest variation, and so on. In most biological, the majority of the variance is described by very few eigenvectors, which is why PCA/SVD is used for dimensionality reduction.

Here you can find some powerpoint slides that I think are fairly good at describing the basic math and interpretation behind SVD.