Dear all,
as I have just started reading the documentation on SEURAT for scRNA-seq (among a few other packages), I would appreciate having your answers and insights please on the following :
after NormalizeData() function, why ScaleData() function is needed ?
is FindVariableGenes() or RunPCA() or FindCluster() working on Normalized_Data or on Scaled_Data ?
is ScaleData() working on RAW_DATA or on NORMALIZED_DATA ?
is RunCCA() working on Normalized_Data or on Scaled_Data ?
an example of R code is at : https://satijalab.org/seurat/immune_alignment.html
thanks a lot !
-- bogdan
"Data is scaled to regress out "uninteresting" sources of variation such as technical noise."
I believe this is not quite right, data is scaled so that each feature (gene in this context) contributes similarly to the downstream steps. Regressing out unwanted signal, which by the way should be used with caution(1), is optional and is not the primary objective for data scaling.
1) A blog post on regression on scRNA-seq datasets
Yes, your interpretation is true. The main purpose of scaling is to make data comparable across the genes. Regression is a secondary (and optional) effect of scaling. From
?ScaleData
Indeed, in the past versions, REgression was a separate function than ScaleData
I guess I (and the Seurat tutorial) did not explicitly mention the primary objective. Yes, the scaling adjusts the range of expression values across all the genes, which will likely impact the downstream analysis far more than any additional regression. When I originally wrote the answer, I was thinking specifically in the context of the Seurat workflow in addition to the default
scale
function.Hi Igor, thank you for your reply. If I may add a question please :
is ScaleData() working on RAW_DATA or on NORMALIZED_DATA ?
thank you !
You go from raw to normalized to scaled.
I think it would be more accurate to say "which standardizes the range of expression values for each gene." I think ScaleData() adjust the expression value gene by gene. For each gene, it build a regression model using that gene's expression level across all cells, and then shift the residual to zero and divided it by standard deviation. The "across all the genes" is not accurate.
Am I right?
I edited the statement to make it more clear. I originally meant that all genes (as opposed to all cells) are scaled, but I can see how it can be interpreted in a different way.