Hi, In Seurat I would like to understand the algorithm behind
FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
My understanding : This function compute a score for each gene to select the 2000 bests for the next step, the PCA. For a gene, the more variability in the counts matrix for each cells the better. The Seurat Algorithm is described in the following paper : https://www.biorxiv.org/content/biorxiv/early/2018/11/02/460147.full.pdf page 12 : Feature Selection for better dataset
What I understand : 1) For each gene they compute the mean and variance of not normalized UMI and apply log10 for var and mean 2) If we plot var in function of mean we should have something similar to a line so they perform a polynomial regression (order 2) 3) they recompute an other variance using mean and the last regression
if you follow the tutorial https://satijalab.org/seurat/archive/v3.2/pbmc3k_tutorial.html just before FindVariableFeatures you can display first line of count matrix :
data_counts <- GetAssayData(object = pbmc, slot = "counts")
print(head(data.frame(data_counts ), 1))
which gives u a line with AL627309.1 and nine counts with 1 in each count
just after you can display the results of FindVariableFeatures:
print(head(HVFInfo(object = pbmc),23))
which gives u
mean variance variance.standardized
AL627309.1 0.0034116755 0.0034013253 0.9330441
AP006222.2 0.0011372252 0.0011363627 0.9924937
RP11-206L10.2 0.0018953753 0.0018925002 0.9627290
How can have a mean and a variance of 0.0034 with an array of [1,1,1,1,1,1,1,1,1] ? mean([1,1,1,1,1,1,1,1,1]) = 1 log10(1) = 0
i have seen a part of the source code here : https://rdrr.io/cran/Seurat/src/R/preprocessing.R but it doesn't help me
it is a technical question but if someone have an idea of how mean and var is computed it will be a pleasure to understand.
Kind regards
Matt