Question

Selecting specific non-consecutive PCs for scRNAseq analysis

0

Entering edit mode

2.5 years ago

eturkes • 0

Hi everyone,

I am working with data where we have a short list of genes (less than 50 total, split into 10 or so groups) that we want to use to cluster our data at a broad level before subcluster analysis on each broad cluster. I've been toying with the idea of doing my UMAPs and graph-based clustering using PCs that drive the largest amount of variation for my short list of genes. This was straightforward to implement and I am now evaluating the quality of the results. Meanwhile, I was wondering if anyone can point me to a reference where this approach had been taken before? Or share personal experiences? It's not something I'd seen, but intuitively it makes sense to me if you want supervised clusters in accordance to curated genes. I like using PCs as the selected ones should contain coexpressing genes not in my original list related to the variation, though I am concerned about using non-consecutive PCs, as that seems particularly unconventional.

reduction dimensionality principal scRNAseq components • 871 views

ADD COMMENT • link 2.5 years ago by eturkes • 0

score 1 · Answer 1 · 2022-06-06

1

Entering edit mode

2.5 years ago

Mensur Dlakic ★ 28k

I have no particular expertise in what you are trying to do.

This strikes me as a very biased approach, and I'd be surprised if it has any future in general use. Picking your own genes to analyze is fine, and you can probably get UMAP embedding just on that subset. The rest of genes can then be embedded based on what was learned from the small subset. That will still carry a bias, but that would be a bias that is implied and understood. I think doing it in a contrived way where one gets to pick and choose which PCs are used is a different kind of bias.

ADD COMMENT • link 2.5 years ago by Mensur Dlakic ★ 28k

0

Entering edit mode

Thank you for your comment! I didn't know a UMAP embedding can be learnt from a subset like that. The UMAP documentation seems to have a section discussing this, so I'll look it over (https://umap-learn.readthedocs.io/en/latest/supervised.html).

Concerning bias, I'm not sure why its an issue in the first place if I'm intentionally trying to separate my cells based off a restricted set of genes? It just strikes me as sensible if one wants first a broad grouping (i.e. EX neurons, IN neurons, Glia), before going in on those groups for fine-grained analysis.

ADD REPLY • link 2.5 years ago by eturkes • 0

1

Entering edit mode

Concerning bias, I'm not sure why its an issue in the first place if I'm intentionally trying to separate my cells based off a restricted set of genes?

In my opinion, the issue is that cells may not be able to separate based on a restricted set of genes. But if you pick a restricted set of PCs derived from a restricted set of genes, chances are that you can separate them any way you want, because you get to pick which variables work exactly for your intended clustering. That's why I suggested UMAP on a restricted set of genes, because that at least removes the second type of bias.