Question

Data scaling in single-cell RNAseq - sample-wise or on the full set?

0

Entering edit mode

11 months ago

e.r.zakiev ▴ 230

There was a similar post, but I'd like to explore the scaling part more profoundly here, as the current practices seem counter-intuitive to me.

This is from Seurat's basic integration vignette:

library(Seurat)
library(SeuratData)

# install dataset
InstallData("ifnb")

# load dataset
ifnb <- LoadData("ifnb")
# split the RNA measurements into two layers one for control cells, one for stimulated cells

ifnb[["RNA"]] <- split(ifnb[["RNA"]], f = ifnb$stim)

# run standard anlaysis workflow
ifnb <- NormalizeData(ifnb)
ifnb <- FindVariableFeatures(ifnb)
ifnb <- ScaleData(ifnb)

Let's pause here for a moment and think about why the scaling was done on the whole merged matrix of normalized counts. Shouldn't it have been done on a per-sample basis? Wouldn't we want to 'equalize' the ranges of gene expressions levels between samples before merging them?

ifnb <- ScaleData(ifnb, split.by = 'orig.ident')

But when I do the scaling this way, the visualisation is different from what they get in their plain analysis. Compare the un-integrated analysis in their case vs mine:

enter image description here

And the integrated analysis:

enter image description here

And before you say that this is not a big difference, this is only a toy example, in my own datasets the changes are quite drastic, let's say.

scRNA-seq seurat • 664 views

ADD COMMENT • link updated 11 months ago by bk11 ★ 3.0k • written 11 months ago by e.r.zakiev ▴ 230

score 2 · Answer 1 · 2024-01-29

Let's pause here for a moment and think about why the scaling was done on the whole merged matrix of normalized counts. Shouldn't it have been done on a per-sample basis? Wouldn't we want to 'equalize' the ranges of gene expressions levels between samples before merging them?

If you are using Seurat V5, the issue you have asked here has been addressed Here. It does find variable features, normalization, scaling and dimensional reduction at individual sample level. If you find batch effect (you will always), you can integrate data using appropriate approach discussed there.