Question

Gene Lists Using Principal Component Analysis In Microarray Gene Expression

11

Entering edit mode

13.2 years ago

Tonig ▴ 440

Dear all,

I'm a totally newbie on PCA analysis, so here is my question:

I'm working with a list of genes coming from Microarray gene expression analysis; let's say I have the genes in rows and the sample names in the columns, I did a PCA analysis in R using princomp in order to reduce the dimensionality of genes (i.e approx. 400). I know that I must choose the components with higher variance over the total, that is the first two. The problem arises when I have to choose those genes that contribute most in each component to the amount of variance: May I use the scores for each gene? May I choose these genes only for first component or from both two components?

Thanks

microarray pca r • 22k views

ADD COMMENT • link updated 2.7 years ago by Ram 44k • written 13.2 years ago by Tonig ▴ 440

6

Entering edit mode

Just a note that even though PC1 captures the largest share of the variance, it is not always the most interesting biologically. Sometimes PC1 captures non-biologically-interesting features like technical artifacts, batch effects, and the like. Some caution is required in interpretation....

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 13.2 years ago by Sean Davis 27k

Ram · Answer 1 · 2011-10-12

15

Entering edit mode

13.2 years ago

Janne Marie Laursen ▴ 170

As far as I understand you want to find the genes (p) that are the sources of the the majority of the variance between your samples (n).

You will have to look at entries in your loadings vectors.

pca.object <- princomp(data.matrix)  # data.matrix is a [n p] matrix
pca.object$loadings  # Your loadings are here

Then you can look at which genes of the genes that have the most extreme loadings.

Loadings range from -1 to 1, and the higher the numerical value of a gene's loading is, the more that gene means for the variance of the principal component in question.

On the other hand, the scores of e.g. PC1 will tell you how the samples differ according to the genes that have high loadings on PC1.

Just a side-note: Have you considered scaling your data-matrix?

EDIT:

You want to see which genes that mean the most for the differences between the samples, and therefore your samples should be in the rows and your genes should be in the columns. As far as I see, you should not transpose your data matrix.

And by the way, in R, use the prcomp function instead of princomp (for numerical stability). prcomp also has the input option of centering and scaling, which you would like if the magnitude of the numbers in your matrix are not of a comparable size.

pca.object <- prcomp(data.matrix, center=TRUE, scale=TRUE)  # PCA with centering and scaling
pca.object$rotation  # The loadings are here

ADD COMMENT • link updated 2.7 years ago by Ram 44k • written 13.2 years ago by Janne Marie Laursen ▴ 170

0

Entering edit mode

Many thanks Janne! I'll try that way. However, I don't know if I'm doing in the right way: Must I transpose the data (genes on columns and samples on rows), or can I follow in the same way

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 13.2 years ago by Tonig ▴ 440

0

Entering edit mode

See my edit above :)

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 13.2 years ago by Janne Marie Laursen ▴ 170

0

Entering edit mode

Janne Marie,

Can you please elaborate on what you mean by:

prcomp also has the input option of centering and scaling, which you would like if the magnitude of the numbers in your matrix are not of a comparable size.

I don't get when it is that the magnitude of the numbers are not of a comparable size? I am working with gene expression counts (RNA-seq) and each sample's gene counts have been normalized to the library size/sequencing depth... should I still perform this "centering" and "scaling"?

Thank you.

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.8 years ago by gaelgarcia ▴ 270

2

Entering edit mode

For RNA-seq applications, you may need to apply a variance stabilizing transformation. A simple one is to log the counts. However, more robust ones exist. See, for example, the Bioconductor DESeq2 package vignette, which has a section on visualizing RNA-seq data.

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.8 years ago by Sean Davis 27k

0

Entering edit mode

Thank you Sean. I am variance-stabilizing my RNA seq data with DESeq's rlog function -- is it still required that I center and/or scale the normalized-variance stabilized counts?

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.8 years ago by gaelgarcia ▴ 270