I want to plot my data (timeseries dataset) using PCA and am wondering how many "most variable" genes I should take. For instance, in DESeq2, the default for plotPCA
is ntop=500
, but why not 1,000 or simply all genes?
I saw on this page that
It's a tradeoff between computational efficiency and getting the most information out of your data. (...) In most situations I wouldn't expect much difference between a PCA plot of the top 500 genes and all of them.
However, in my case, it does change the aspect of the PCA and the relative contribution of the axes. For instance,
- for
ntop=500
, PC1=62% and PC2=7% - for
ntop=1,000
, PC1=54% and PC2=10% - for
ntop=10,000
, PC1=31% and PC2=18%
(I'm sorry, I cannot upload the actual graphs).
Which one should I "trust" more? Should I take all genes or rather a subset of the most variants, and if yes, how many? Many thanks!
If you sum PC1 and PC2, ntop =500 obtain 69, the other are below. I would take 500, because it explains globally better your results