I am processing the same dataset with both Seurat and Scanpy. In Seurat, I got 3 clusters and cluster 2 seems like the target cell type; I got 2 clusters in Scanpy and cluster 1 seems like the target. I am trying to get the marker genes that shows up in both target clusters. But I have two questions...
My understanding is that if I didn't specify ident.2 in Seurat's "FindMarkers", Seurat will find the marker genes comparing the ident.1=0 to the rest, am I right? If so, what's the difference between FindMarkers(object = dataa, ident.1 = 2, min.pct = 0.25), FindMarkers(object = dataa, ident.1 = 2, ident.2 = c(0, 1), min.pct = 0.25) and FindAllMarker(object=dataa, min.pct=0.25)?
Since I'm comparing Seurat result with Scanpy's "sc.tl.rank_genes_groups", which processing method in question 1 should I compare with?
I'm really confused, it would be helpful if someone can explain these to me.
Thank you so much!
You can always open an issue at their Github asking for clarification, but please make sure to browse other issues to ensure that this has not been covered before. Just as a comment, you would need to ensure that the exact same normalization and clustering has been performed when comparing methods, and I doubt that this will be the same between your results and those of your colleague, it is therefore expected that results change, regardless of method.
Well, to compare scanpy and seurat methods, we started from a same simple dataset and performed in parallel different steps, including filtering, normalization (clustering was not performed because we compared all cells from 2 conditions). We used the same parameters and double checked the same results obtained for each step for both methods. The only difference between the two methods was found in the DEG analysis, with the cause explained above.
After rechecking different discussions, scanpy authors use geometric means on purpose to be less sensitive to outliers while seurat uses arithmetic means (more sensitive if a given gene is only expressed on a small subset of cells in one group). On Scanpy docs they say for the LogFoldChange "Note: this is an approximation calculated from mean-log values.", but it may not be clear enough.
I just wanted to clarify the comment of Tris above ("looks like the approaches are not that different, and Scanpy's rank_genes_group is similar to Seurat FindMarkers"). In the lights of the findings, the 2 methods will not likely give exactly the same results (gene lists and foldchanges), depending on the dispersion of the gene expression. Both methods are valid anyway. People uses one or the other according to their preference for python or R, but since it is rare that they use both approaches at the same time, they need to be aware of the difference they may find when use both.