Question

How to understand the multi-sample processing of Seurat?

1

Entering edit mode

4.7 years ago

xiaoguang ▴ 160

There is a question about Seurat that has been bothering me all the time. I would like to ask your opinions. For the case of multiple samples, In order to eliminate the batch effect, Seurat clustering will choose to integrate multiple samples, which is no problem. But when searching for markers, it chooses to use the raw data matrix in order to avoid the elimination of differences by data integration.However, when using Heatmap to display markers, integrated data matrix is used to plot, which leads to many differentially found genes that are not different when visualized.The same thing happens when looking for differential genes and subsequent functional analyses, such as GSVA, receptor ligand pairs, etc., should these analyses use raw or integrated data?If the integrated data is used, will it affect the identification of differences, and if the original matrix is used, will there be a batch effect?

RNA-Seq scRNAseq seurat • 4.8k views

ADD COMMENT • link updated 4.7 years ago by Friederike 9.0k • written 4.7 years ago by xiaoguang ▴ 160

1

Entering edit mode

Just as a caution, in my experience Seurat MultiCCA over-integrates the data. I have been advised to use fastMNN, that I also find to be a better alternative.

ADD REPLY • link 4.7 years ago by piyushjo ▴ 710

0

Entering edit mode

The fastMNN method is described in detail in the Bioconductor scRNA-seq book

ADD REPLY • link 4.7 years ago by Friederike 9.0k

0

Entering edit mode

There a lot of methods with different performance that varies based on the dataset. See previous discussion for some examples: about batch correction in scRNA-seq

ADD REPLY • link 4.7 years ago by igor 13k

score 3 · Answer 1 · 2020-03-28

I recommend to read the section about data integration and the usage of the thus normalized data by the Bioconductor folks: https://osca.bioconductor.org/integrating-datasets.html#using-corrected-values

In brief, data integration is mostly important for dimensionality reduction and cluster identification where you want to be able to identify cells that are very similar to each other. I.e. you want to avoid that neurons in sample A are sorted into a different cluster from the same type of neurons from sample B due to batch effect. To this end, you really want to zoom into those features (genes, eigenvectors, ...) that capture the essence of a cell's biology, i.e. you will probably reduce the final data set that is used for UMAP and clustering down to 20-30 principal components that are usually sufficient to align the cells across technically distinct samples.

As soon as you're going to the gene level, though, it is better to work with the raw data, especially if you happen to have replicated samples, which allows you to apply the typical bulk RNA-seq tests while accounting for the batch effects as covariates. If you want to draw conclusions on a single-gene level, you need to go back to the actual read counts per gene and apply the proper tests on those.