To resolve the unanswered state for this question. In agreement with most of the comments already given, the answer is, that at least in this case it doesn't make much sense to use PCA for gene ranking. This is because you have a Case vs. Control setting, which means you have a "2-dimensional" problem, the applications of PCA described in the paper are directed towards time series or other higher dimensional measurements.
Therefore you will get max. 2 principal components, and if you wanted to remove one, for eg. dimension reduction or noise reduction, you have one left. That is not good for doing a statistical test where you wish to compare two conditions.
Of course, one could rank the genes by their factor loadings (projection of the data on the first principle axis), but that doesn't seem to have any advantage in a case-control setting. A statistical test has the advantage of providing estimate of significance (aka. p-values), and allows to estimate power, etc. A PCA is a totally different technique, and doesn't provide these estimates. Unless you can better define the use-case and answer the question why a non-standard method should be applied I would stick with an established method.
You didn't tell if you have replication, but I guess so; therefore if you wanted to use PCA you need to decide at which point in your analysis you wish to summarize the replicates. At that point however, you are going to loose information about within group variance. In a statistical test, for example ANOVA, within group variance would be needed and compared to between group variance. Therefore, it is important to keep within group variance until the statistical test.
What do you mean by "gene ranking"? What's the criteria for ranking?
Well what I actually meant was "gene prioritization" based on expression values, not "ranking" as such, I have edited my post.
Have you done a more traditional differential expression analysis using DESeq or edgeR, for example? This will rank genes based on expression value differences between cases and controls.
Yes, I already have the DEG's from DESeq, I was just a bit curious if somebody has tried any of the PCA based approaches and what are the caveats in doing such an analysis
So you want to use PCA for differential expression ranking? I am interested in how this works, can you link any papers of this approach? Are they just using PCA as some kind of a smoothing function?
Here's an old but nice one on time-course analysis using PCA.
So given 2 datasets, A and B. They perform PCA on data set A, project B on to A and use the newly projected coordinates to get differential expression. I am not sure what test they are using for the differential expression though. Some kind of ANOVA? I guess the advantage of this is: 1) It is taking the time-course relationship into account. 2) using only the dominant components is kind of a smoothing function as it de-noises the dataset.
You can have a look at this paper for other PCA based applications, I found sparse PCA and supervised PCA to be quite interesting
Thanks for the papers. I am actually working with time-course RNA-seq data right now, so this is of interest to me. BTW, I posted some brief code on how to do PCA and visualize it with python in matlibplot couple days ago: http://blog.nextgenetics.net/?e=42
Dk, nice post. However, I find that it would be nice to explain the actual concept behind (PCA) and purpose (why in time-series?) in addition to just the code. I love theory! :)
I've actually been working on a post to explain PCA, just haven't gotten around to finishing it. It's a surprisingly simple concept if you ignore all the crazy maths which I suck at anyways. :) It is essentially just changing the coordinate system's axis (x,y,z..) into a series of orthogonal (perpendicular) best fit lines.
Sudeep, I get "content not found"
Sorry for that, I was logged in from my institute account with direct access to the manuscript, now edited that, please try again.
Unfortunately I couldn't find any interesting papers for read count data. As I said in the post all the papers I saw used PCA just to cluster samples but for microarray I found a couple of papers like the one posted by Arun in reply but I am not sure how the statistics works out for read count data
It makes sense because PCA is a tool for either clustering or dimensionality reduction, as far as I've understood. So, it doesn't make much sense to me in comparing replicates of a gene over two conditions using PCA.