I am trying to figure out flaws in the following reasoning:
- Seurat
RunPCA
operates on scaled data. - Variance per gene on scaled data is equal to one by definition.
- So total variance in scaled data is equal to count of genes.
- Total variance in the data is also equal to total variance in all principle components (PC).
- It is common practice to calculate % of explained variance per PC by dividing its variance (sd^2) by the sum of 10-30 top PC variances.
I have checked a few scRNA-Seq objects and the variance sum of top 10-30 PCs is appreciably lower from gene count. Variance steeply decreases for the first PCs but total variance in the remaining PCs (say 1000 components) is far from being negligible. So % of explained variance per PC using common practice seems grossly over-estimated
Thanks, that may well be the flaw. Seurat tutorials mention taking a limited number of components (50 by default) and in this case the sum of variances is not representative of total variance. Conversely
prcomp
computes all PCs by default and the sum is total variance.The reason why one chosen a subset of PCs is, as you mention correctly, because the % of explained variance drops rapidly after the first few PCs. It is a trade-of between using more PCs and potentially more genes (=information) while likely adding more noise. Therefore one often limits the analysis to the first 30, 50...PCs or plots PCs vs explained variance and then eveballs a cutoff based on the curve. The OSCA workflow discussed many single-cell related topics in way more detail than the Seurat documentation, devinitely a good read: https://osca.bioconductor.org/
A quick glance at chapter 9.2 of osca sort of suggests sending us back to square one. if top PCs capture coordinated biological processes while the long tail of other PCs is mainly random noise, the denominator of
% explained
should probably be the sum of those top PCs.PCAtools (my package) will also compute and return all PCs for you. See:
There are also methods in PCAtools to choose an optimum number of PCs (these were implemented by Aaron Lun).
There is no logic to Seurat automatically taking 30 PCs.
Not true. It depends on the scaling (there are many ways to scale data).