Hi,
I am analyzing two scRNA-seq samples. When running the ambientRNA removal step, I tested two tools: SoupX and DecontX. However, I am getting very different results in the prediction of ambientRNA fraction in some cells between both softwares. I know both are sensitive to the clustering information provided as input, so I run both using the naive clustering from a preliminary run of Seurat with 19 clusters.
SoupX predicts a mean contamination (rho) of 0.01 per cell, that later on changes slightly on a cell-by-cell basis
summary(df_contamination$soupX_contamination)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.002320 0.009352 0.011108 0.010817 0.012374 0.037122
However, decontX predicts a much wider range of contamination, with certain cells reaching 95%:
summary(df_contamination$decontX_contaminationSeurat)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0005773 0.0258687 0.0560367 0.0977059 0.1146198 0.9553540
And this is the comparison between methods:
soupX vs decontX all clusters:
However, when looking cluster-by-cluster, I see that some of them have much higher ambient prediction than others:
soupX vs decontX clusters 0 to 11:
soupX vs decontX clusters 12 to 18:
I know that both methods use over/under-representation of markers in the soup vs markers in the cluster to ascertain which genes belong to which fraction.
However, my biggest issue with all this is that this tissue is aorta aneurysm. We expect that the cell dissociation process will be much harsher in some cell types than others. Probably on those we are most interested in. This might lead to the soup composition to be overrepresented in those genes, and overcorrecting for them.
Still, decontX seems to be able to filter out celltype-specific markers derived from the literature in different clusters:
decontX celltype markers:
Which of the two methods is more reliable? Which would work better in this case? Should I simply skip the ambientRNA detection step?
Very good question and analysis. I personally only tried SoupX so can't weigh in on the comparison. This might be not really the issue here but I suggest you check your clustering. I don't have any input about the importance of good clustering for a proper ambient mRNA decontamination, maybe someone else can weigh in on that, but I know that when trying to interpret scRNAseq data naïve clustering with Seurat without any optimal clustering resolution consideration can be fraught with peril. I tried the clustree but it's very clunky and non-transparent. I personally switched to cNMF as the clustering tool and so far it was giving me quite nice results... It just works
The point is that you need to feed the ambient algorithm a clustering list as input. This way, it checks for genes present in the soup vs genes highly expressed in the cluster. And those genes are kept (as they should belong to the cell fraction in that population) while the rest of the ambient genes are supposed to belong to the soup contamination and are "removed" from that cluster.
I suppose that this helps when the soup is composed mainly by a subset of cell types, so they do not remove the signal from the actual cells from that type.
But none of this explains the wild differences in ambient % estimation between both methods, when even the clustering info is identical.