Question

Relative variance in PCA of scaled data of scRNA-Seq

0

Entering edit mode

5.4 years ago

jomo018 ▴ 730

I am trying to figure out flaws in the following reasoning:

Seurat RunPCA operates on scaled data.
Variance per gene on scaled data is equal to one by definition.
So total variance in scaled data is equal to count of genes.
Total variance in the data is also equal to total variance in all principle components (PC).
It is common practice to calculate % of explained variance per PC by dividing its variance (sd^2) by the sum of 10-30 top PC variances.

I have checked a few scRNA-Seq objects and the variance sum of top 10-30 PCs is appreciably lower from gene count. Variance steeply decreases for the first PCs but total variance in the remaining PCs (say 1000 components) is far from being negligible. So % of explained variance per PC using common practice seems grossly over-estimated

variance scRNA-Seq Seurat PCA • 9.0k views

ADD COMMENT • link 5.4 years ago by jomo018 ▴ 730

score 1 · Answer 1 · 2020-06-23

1

Entering edit mode

5.4 years ago

ATpoint 89k

It is common practice to calculate % of explained variance per PC by dividing its variance (sd^2) by the sum of 10-30 top PC variances.

Total variance is the sum of all PCs, not the first 10-30. Any reference for your statements?

ADD COMMENT • link 5.4 years ago by ATpoint 89k

0

Entering edit mode

Thanks, that may well be the flaw. Seurat tutorials mention taking a limited number of components (50 by default) and in this case the sum of variances is not representative of total variance. Conversely prcomp computes all PCs by default and the sum is total variance.

ADD REPLY • link 5.4 years ago by jomo018 ▴ 730

0

Entering edit mode

The reason why one chosen a subset of PCs is, as you mention correctly, because the % of explained variance drops rapidly after the first few PCs. It is a trade-of between using more PCs and potentially more genes (=information) while likely adding more noise. Therefore one often limits the analysis to the first 30, 50...PCs or plots PCs vs explained variance and then eveballs a cutoff based on the curve. The OSCA workflow discussed many single-cell related topics in way more detail than the Seurat documentation, devinitely a good read: https://osca.bioconductor.org/

ADD REPLY • link 5.4 years ago by ATpoint 89k

0

Entering edit mode

A quick glance at chapter 9.2 of osca sort of suggests sending us back to square one. if top PCs capture coordinated biological processes while the long tail of other PCs is mainly random noise, the denominator of % explained should probably be the sum of those top PCs.

ADD REPLY • link 5.4 years ago by jomo018 ▴ 730

0

Entering edit mode

PCAtools (my package) will also compute and return all PCs for you. See:

There are also methods in PCAtools to choose an optimum number of PCs (these were implemented by Aaron Lun).

There is no logic to Seurat automatically taking 30 PCs.

Variance per gene on scaled data is equal to one by definition

Not true. It depends on the scaling (there are many ways to scale data).

ADD REPLY • link 5.4 years ago by Kevin Blighe 89k