Hi,
I have been trying different methods to identify immune sub-populations.
For cell identification in general, the most commonly used method in papers seems to be SingleR, others cluster and define different clusters based on the most HVGs.
Since there is no gold-standard to this date, I have tried SingleR with different references (fine labels), ProjectTIL, scPred, and identifying markers for different clusters - but across these methods none agrees! For major labels this is less of a problem.
So for example, how can I justify, or any paper as a matter of fact, that a cell is a T Cell CD4 Effector Memory if across all these methods none of the labels agree? Do people just decide to base cluster or cell annotation using the method that fits their bias best? Would be great to hear some opinions to know how to move forward with cell subtype identification in my data. Thank you!
I think the annotation tools are great starting points - I have been using them to look at the distributions along with the dimensional reduction - here is an example (below) I made using the ProjecTIL method. Then I will look at the relative assignments by clusters (relative proportions) and finally as @jared.andrews07 mentions, using a manual annotation with canonical markers. For me the annotation step is the most time-consuming of most projects.
Thank you @jared.andrews07 and @theHumanBorch for taking the time to reply! Especially since I am currently writing my PhD thesis and heavily citing SingleR and scRepertoire!
One of my worry with marker-based, even if I have a strong marker list for my cells of interest, is that I may be forcing labels onto cells that may have clustered together based on state and not subtype (e.g. CD4 Effector Memory would cluster with CD8 Effector Memory or Anergic CD4 with Anergic CD8). Using reference-based, on the other hand, the reference itself introduces a bias, so looking at the same dataset different people trusting different references will come up with different subtypes.
Indeed, this will hopefully improve over time and until now I've also been using a combination of methods, similar to what @theHumanBorch is showing, but I find it hard to tell which methods or even references should be given more weight when trying to narrow down these fine-grained labels.
Not a whole lot to do for it - though if you can tie canonical markers (e.g. from flow/IHC that the field typically accepts as truth) to reference data, that can help you pinpoint which are decent.
Like Nick said, there's going to be a manual component, though I'd be inclined to give projectTIL a try based on what Nick showed above - I struggled to get labeling that clear for T cell subtypes with most immune references (even if they had labels for such populations).