I have a task to overlay the “good and known” single-cell annotations on the newly generated PC matrix (or its corresponding knn-graph) of the same dataset (cells retain annotations). Cell types may differ in the new clustering, but also may largely overlap with the known labels.
Is there an automated way (or utility) to estimate cell label fitness in a clustering (with some confidence stats)?
This is something which is coming up fairly frequently for me, and I'm all ears for other solutions.
On the basis of first-principles, one can define a notion of "consistency" between a labeling L, an embedding X, and a clustering algorithm f(X;p) as the existence of a set of parameters P* such that f(X,P*)_i = L_i. In other words, if there's a k and resolution parameter (for Leiden clustering) such that you can recapitulate the "good and known" labels, then the embedding is "consistent" with those labels (for Leiden clustering).
In practice, I vary k and resolution and compute the adjusted rand index of the resulting clusters against the original labels.
The notion of "confidence stats" is ill-defined in this scenario. What sampling process are you attempting to characterize? The data generation process for this dataset? The approximation error of the algorithms? You can "blindly" use a bootstrap or jackknife, and this will produce confidence estimates, but it's not clear what random variable these are estimates over...
Thank you - I see the point. You touch on the theory behind to explain why and how the new clustering differs (e.g. in ARI). But the current task is rather simpler and more "technical" - to reconcile the given labels with the given clustering, without reference to cluster generation and baseline expression data. In simple words, if a cluster represents 96% label X, 3% Y and 1% a mixture of labels, X is propagated to the new annotation with 96% support; if it represents a complex label mixture, no annotation is propagated.
I wonder if there is an automated tool for that or we need to code some..
Please do not add answers unless you're answering the top level question. Instead, use Add Comment or Add Reply as appropriate. I've moved your post to the right location this time, please be more careful in the future.
To do what you want is a few lines of code. You can use table() in R or .value_counts() in pandas, normalize by cluster size, and exclude records below 80% (or whatever threshold you set).
Thank you - I see the point. You touch on the theory behind to explain why and how the new clustering differs (e.g. in ARI). But the current task is rather simpler and more "technical" - to reconcile the given labels with the given clustering, without reference to cluster generation and baseline expression data. In simple words, if a cluster represents 96% label
X
, 3%Y
and 1% a mixture of labels,X
is propagated to the new annotation with 96% support; if it represents a complex label mixture, no annotation is propagated. I wonder if there is an automated tool for that or we need to code some..Please do not add answers unless you're answering the top level question. Instead, use
Add Comment
orAdd Reply
as appropriate. I've moved your post to the right location this time, please be more careful in the future.To do what you want is a few lines of code. You can use
table()
in R or.value_counts()
in pandas, normalize by cluster size, and exclude records below80%
(or whatever threshold you set).