A question on the interpretation of a PCA plot
1
0
Entering edit mode
4.0 years ago
Aspire ▴ 370

Suppose that when one asks the question if any of the first two PCs separate strongly between the conditions of his data, the answer is negative (or a weak separation at best). But, when looking upon a two dimensional PCA plot of the two PCs, one realizes that s\he can draw a diagonal line that would separate the conditions stronger.

Take for example this PCA : enter image description here

There is some separation across PC1, and some separation between PC2. But a much stronger separation (between the blue&red conditions and on the other hand orange&pink&cyan conditions) would be, if one would draw a diagonal line (the green line in the plot).

What should be the conclusion? How should one assess the quality of the separation between conditions in such a case - according to the green line, or strictly according to the projections on PC1 & PC2 ?

pca • 3.0k views
ADD COMMENT
2
Entering edit mode
4.0 years ago

You want to look at each PC separately, because each PC represents some collection of genes that separate some subset of samples from another subset of samples. In your case it seems as though PC2 is really what is separating the two main groups you want separated. The problem you may run into is that PC1 explains a large chunk of the variance in the data, but there is no clear separation in your desired groups. Whether this is due to some batch effect, natural variance in samples collected, or something else would be something for you to think about.

You might also want to explore more than just 2 PCs, because there is still 40% of variance in your data that is not explained by the first two PCs. Furthermore, you can look at the loadings for each PC, which are the genes that are contributing to the separation of samples in each PC. These may help you further explore your data.

ADD COMMENT
0
Entering edit mode

In a true PCA, each PC is a linear combination of all genes. The genes with significant loading are probably different but there's going to be some contribution of the same genes in both PC1 and 2. One could take the definition (loading) of PC1 and PC2 and work out the difference in the groups (looks like PC1 minus half PC2) and then learn which genes are different between groups. But then we've just backed into an awkward differential expression analysis. I should ask why they're using PCA at all, but this looks like a homework problem.

ADD REPLY
0
Entering edit mode

I oversimplified for the sake of brevity, but it is indeed true that every gene is going to appear in the loadings of each PC.

ADD REPLY
0
Entering edit mode

I'd like to post two images for clarification. In the first one, the samples are separated nicely on PC1. enter image description here

In the second one, the separation is only based on a combination of PC1&PC2. enter image description here

Do you imply that the separation in the first image is actually much better than in the second one?

This is for the sake of better understanding PCA.

ADD REPLY
0
Entering edit mode

Define "better separation", if this were a two dimensional measurement (length and width) then both are equivalent. But in PCA the PC1 has the largest component of variation, and PC2 has the second most variation of any linear direction. To get such segregation is very suspicious and indicates a problem with your dataset. PC1 will have the variation of the majority factor, and could be cell type or sample handling or technician error. In gene expression you should have divergent units: a length measured in millimeters will have a magnitude more numerical measure variance than a width measured in kilometers, thus putting the mm/length onto PC1. Generate some more datasets in 2 and 3d and take a look at what PCA does for each!

ADD REPLY

Login before adding your answer.

Traffic: 1932 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6