Hi everyone,
I am working with data where we have a short list of genes (less than 50 total, split into 10 or so groups) that we want to use to cluster our data at a broad level before subcluster analysis on each broad cluster. I've been toying with the idea of doing my UMAPs and graph-based clustering using PCs that drive the largest amount of variation for my short list of genes. This was straightforward to implement and I am now evaluating the quality of the results. Meanwhile, I was wondering if anyone can point me to a reference where this approach had been taken before? Or share personal experiences? It's not something I'd seen, but intuitively it makes sense to me if you want supervised clusters in accordance to curated genes. I like using PCs as the selected ones should contain coexpressing genes not in my original list related to the variation, though I am concerned about using non-consecutive PCs, as that seems particularly unconventional.
Thank you for your comment! I didn't know a UMAP embedding can be learnt from a subset like that. The UMAP documentation seems to have a section discussing this, so I'll look it over (https://umap-learn.readthedocs.io/en/latest/supervised.html).
Concerning bias, I'm not sure why its an issue in the first place if I'm intentionally trying to separate my cells based off a restricted set of genes? It just strikes me as sensible if one wants first a broad grouping (i.e. EX neurons, IN neurons, Glia), before going in on those groups for fine-grained analysis.
In my opinion, the issue is that cells may not be able to separate based on a restricted set of genes. But if you pick a restricted set of PCs derived from a restricted set of genes, chances are that you can separate them any way you want, because you get to pick which variables work exactly for your intended clustering. That's why I suggested UMAP on a restricted set of genes, because that at least removes the second type of bias.
Ahh I see, yes that makes sense. Thanks for taking the time to explain