Question

about batch correction in scRNA-seq

4

Entering edit mode

5.9 years ago

Bogdan ★ 1.4k

Dear all,

referring to the batch correction methods for scRNA-seq, would you have any preference and/or comments ? among possible choices :

-- MNNCorrect, as outlined in SimpleSingleCell workflows :

https://bioconductor.org/packages/release/workflows/html/simpleSingleCell.html

-- ZINB-WAVE :

https://bioconductor.org/packages/release/bioc/html/zinbwave.html

-- HARMONY :

https://www.biorxiv.org/content/10.1101/461954v2

-- SCTransform :

https://satijalab.org/seurat/v3.0/integration.html

thanks a lot,

bogdan

scRNA scRNA-seq batch-effect • 8.2k views

ADD COMMENT • link updated 16 months ago by Ram 45k • written 5.9 years ago by Bogdan ★ 1.4k

0

Entering edit mode

5.3 years ago

Bogdan ★ 1.4k

Dear all,

thank you all for your suggestions ! if I may ask for another suggestion please regarding scRNA-seq analysis:

shall we have 2 scRNA-seq samples that do not align too well by using either CCA (in Seurat 2) or Seurat 3 methods (with batch correction in Harmony, Liger, Conos, etc, as we have discussed above), the functions that compute the CONSERVED MARKERS (FindConservedMarkers) or DIFFERENTIAL MARKERS (FindMarkers) likely fail on the cell clusters that DO NOT ALIGN.

how could I still compute the CONSERVED or DIFFERENTIAL MARKERS on the cell clusters that DO align (in some extent) ? If anyone has the experience and would like to share it please. Many thanks for your suggestions; be safe, stay healthy,

-- bogdan

ps : 've posted a similar question on Seurat github web page, and i have not heard from Seurat's authors about it for a while.

https://github.com/satijalab/seurat/issues/2849

ADD COMMENT • link 5.3 years ago by Bogdan ★ 1.4k

0

Entering edit mode

I think most of the batch correct algo are over-processing/ over-normalizing the data. They are implicitly assume some situations, such as scRNASeq data are neighboring graph, etc, while many real life data may not satisfied. And people should accept the fact that not all samples could be merged.

ADD REPLY • link 5.3 years ago by shoujun.gu ▴ 350

0

Entering edit mode

we have 2 scRNA-seq samples that do not align too well

How do you determine if they align well?

functions that compute the CONSERVED MARKERS (FindConservedMarkers) or DIFFERENTIAL MARKERS (FindMarkers) likely fail on the cell clusters that DO NOT ALIGN

Why are they likely to fail? Why not try to see if they actually fail?

ADD REPLY • link 5.3 years ago by igor 13k

0

Entering edit mode

Hi Igor, thank you for your note. Very helpful, as they have pointed into the correct direction, many thanks !

Regarding the alignment of cells, we evaluate it mainly by the visual examination of TSNE or UMAP plots, and by the number of cells from different samples in each cluster (ie. the ratio).

Regarding your second question, you were right, it has been an oversight on my side, i had tried to print more differential genes than available in a list :

 IDENT1=paste0(i, "_", CTRL)
 IDENT2=paste0(i, "_", STIM)

  LIST.CLUSTERS.and.DIFFERENTIAL.MARKERS[[i+1]] <- FindMarkers(samples.combined, 
                                                                                 ident.1 = IDENT1, 
                                                                                 ident.2 = IDENT2, 
                                                                                 print.bar = FALSE, only.pos = FALSE)

   x <- as.data.frame(as.matrix(LIST.CLUSTERS.and.DIFFERENTIAL.MARKERS[[i+1]]))  
   x$gene <- row.names(x)

   write.table(x, file=paste(NAME, 
  "figure8.samples.combined.here.DIFFERENTIAL.MARKERS.cluster", i, "LIST.txt", sep="."), 
                  sep="\t", quote=F, row.names=T, col.names=T)

   x_count_genes = dim(x)[1]

ADD REPLY • link 5.3 years ago by Bogdan ★ 1.4k

score 6 · Accepted Answer · 2019-10-04

From experience, SCTransform does not perform well unless the majority of the cells are of the same type. It will force true unique populations together with a heavy hand, whereas MNN is much more orthogonal in its changes. Seurat even has a wrapper around fastMNN.

Haven't tried the other options though, so can't speak to them.

score 5 · Accepted Answer · 2019-10-05

5

Entering edit mode

5.9 years ago

igor 13k

The results seem to be very experiment-specific. For example, in today's SCRIBE pre-print, all the methods (except the one introduced) perform poorly:

enter image description here

One thing to notice is that they all fail in different ways, so the problems don't seem to be due to some artifact in the data itself. For example, MNN mixes NF and TH, but Seurat splits PEP.

ADD COMMENT • link 5.9 years ago by igor 13k

1

Entering edit mode

Yeah, it'd be great if someone did a nice comparison of methods given how many there are. Like the dynverse did for trajectory analysis.

ADD REPLY • link 5.9 years ago by jared.andrews07 ★ 19k

2

Entering edit mode

There is finally a fairly comprehensive comparison, both in terms of the number of methods as well as the number of datasets: A benchmark of batch-effect correction methods for single-cell RNA sequencing data:

We tested 14 state-of-the-art batch correction algorithms designed to handle single-cell transcriptomic data. We found that each batch-effect removal method has its advantages and limitations, with no clearly superior method. Based on our results, we found LIGER, Harmony, and Seurat 3 to be the top batch mixing methods.

fig2

ADD REPLY • link 5.6 years ago by igor 13k

0

Entering edit mode

I hope I am wrong, but I am not sure anything like dynverse will every happen again.

ADD REPLY • link 5.9 years ago by igor 13k

0

Entering edit mode

I doubt it too, but it's an incredible resource and the people behind it deserve a hell of a lot of credit.

ADD REPLY • link 5.9 years ago by jared.andrews07 ★ 19k

GenoMax · Accepted Answer · 2019-10-04

4

Entering edit mode

5.9 years ago

shoujun.gu ▴ 350

So far, MNN is the best (but still very limit) algorithm for general batch effect correction method. But based on the recent paper (https://www.nature.com/articles/s41587-019-0113-3 ), in some situations, it just exhibits minor improvement than doing nothing. It's all depends on how good your data are.

ADD COMMENT • link updated 5.9 years ago by GenoMax 153k • written 5.9 years ago by shoujun.gu ▴ 350

0

Entering edit mode

thank you Shoujun, I am very glad that we can call Scanorama from R using the reticulate package (https://github.com/brianhie/scanorama)

ADD REPLY • link 5.9 years ago by Bogdan ★ 1.4k

0

Entering edit mode

It also depends on what you do and how you want to analyze your data. If you have data with conditions, e.g. a pertubation versus wildtype, then your data can be as good as they come, you will (if the pertubation has a strong effect) see clustering almost completely driven by presence or absence of the pertubation. If you want to get a unified clustering landscape, assuming that the celltypes are actually very similar on both cases, you will need to use batch correction to enforce clustering by celltype- rather than condition. What I want to say is that it is not only the quality of the data. Batches can also be treatment conditions, or individual donors (like patients). I (worked well for my data using pertubations) recommend to check fastMNN from the batchelor package (bioconductor).

ADD REPLY • link 4.6 years ago by ATpoint 89k