Question

Seurat "FindMarkers" and "FindallMarkers" v.s. Scanpy "rank_genes_groups"

2

Entering edit mode

4.9 years ago

krorange ▴ 20

I am processing the same dataset with both Seurat and Scanpy. In Seurat, I got 3 clusters and cluster 2 seems like the target cell type; I got 2 clusters in Scanpy and cluster 1 seems like the target. I am trying to get the marker genes that shows up in both target clusters. But I have two questions...

My understanding is that if I didn't specify ident.2 in Seurat's "FindMarkers", Seurat will find the marker genes comparing the ident.1=0 to the rest, am I right? If so, what's the difference between FindMarkers(object = dataa, ident.1 = 2, min.pct = 0.25), FindMarkers(object = dataa, ident.1 = 2, ident.2 = c(0, 1), min.pct = 0.25) and FindAllMarker(object=dataa, min.pct=0.25)?
Since I'm comparing Seurat result with Scanpy's "sc.tl.rank_genes_groups", which processing method in question 1 should I compare with?

I'm really confused, it would be helpful if someone can explain these to me.

Thank you so much!

scRNA Seurat R single-cell Scanpy • 15k views

ADD COMMENT • link updated 4.3 years ago by mn.duong ▴ 50 • written 4.9 years ago by krorange ▴ 20

score 5 · Answer 1 · 2021-02-10

5

Entering edit mode

4.3 years ago

mn.duong ▴ 50

Actually, there is a big difference in the DEG formula between Scanpy and Seurat.

In Scanpy, according to the source code of the rank_genes_group (https://github.com/theislab/scanpy/blob/master/scanpy/tools/_rank_genes_groups.py), the foldchange is calculated by taking the exponential of the mean of log values (line 416).

By contrast, in Seurat's FindMarkers function, the foldchange is calculated by taking the mean of the exponential of log values (https://github.com/satijalab/seurat/blob/master/R/differential_expression.R_line 922), which makes more sense.

My colleague used Scanpy and me Seurat to analyze the same dataset and we got quite different foldchange values. After a moment of troubleshooting, we figured out the cause as explained above. I hope that the authors of Scanpy can modify the code, or making statement to clarify the formula they used.

ADD COMMENT • link 4.3 years ago by mn.duong ▴ 50

0

Entering edit mode

You can always open an issue at their Github asking for clarification, but please make sure to browse other issues to ensure that this has not been covered before. Just as a comment, you would need to ensure that the exact same normalization and clustering has been performed when comparing methods, and I doubt that this will be the same between your results and those of your colleague, it is therefore expected that results change, regardless of method.

ADD REPLY • link 4.3 years ago by ATpoint 88k

0

Entering edit mode

Well, to compare scanpy and seurat methods, we started from a same simple dataset and performed in parallel different steps, including filtering, normalization (clustering was not performed because we compared all cells from 2 conditions). We used the same parameters and double checked the same results obtained for each step for both methods. The only difference between the two methods was found in the DEG analysis, with the cause explained above.

After rechecking different discussions, scanpy authors use geometric means on purpose to be less sensitive to outliers while seurat uses arithmetic means (more sensitive if a given gene is only expressed on a small subset of cells in one group). On Scanpy docs they say for the LogFoldChange "Note: this is an approximation calculated from mean-log values.", but it may not be clear enough.

I just wanted to clarify the comment of Tris above ("looks like the approaches are not that different, and Scanpy's rank_genes_group is similar to Seurat FindMarkers"). In the lights of the findings, the 2 methods will not likely give exactly the same results (gene lists and foldchanges), depending on the dispersion of the gene expression. Both methods are valid anyway. People uses one or the other according to their preference for python or R, but since it is rare that they use both approaches at the same time, they need to be aware of the difference they may find when use both.

ADD REPLY • link 4.3 years ago by mn.duong ▴ 50

score 1 · Answer 2 · 2020-08-16

part of the answer can be found here for Seurat: FindConservedMarkers vs FindMarkers vs FindAllMarkers Seurat

for Scanpy's code (https://github.com/theislab/scanpy/blob/master/scanpy/tools/_rank_genes_groups.py):

method
    The default method is `'t-test'`,
    `'t-test_overestim_var'` overestimates variance of each group,
    `'wilcoxon'` uses Wilcoxon rank-sum,
    `'logreg'` uses logistic regression. See [Ntranos18]_,
    `here <https://github.com/theislab/scanpy/issues/95>`__ and `here
    <http://www.nxn.se/valent/2018/3/5/actionable-scrna-seq-clusters>`__,
    for why this is meaningful.

looks like the approaches are not that different, and Scanpy's rank_genes_group is similar to Seurat FindMarkers