Question

General question about clustering in scRNAseq

0

Entering edit mode

3.5 years ago

Rox ★ 1.4k

I have recently finished an online scRNAseq course. I was a complete beginner in the field and I really enjoyed the course and have learnt a lot. Now that I have an overview of single cell, I have a flood of maybe dumb questions that escaped me during the course.

I imagine the general scRNAseq workflow like this :

QC + normalization
Dimensionality Reduction (PCA, t-SNE or UMAP)
Clustering
(Optional, if several dataset) Data integration
(Optional if you want to study cells behavior over time) Trajectory Inference
Differential Expression analysis.

Correct me if I am wrong.

Among all the question, my biggest one is the following : Why clustering exactly ? Maybe most of our practical were using datasets were every cell was already attributed to a cell type. So while trying different methods and parameters for clustering, we kind of already had the truth with the cells labeling. Is clustering for helping you attributing your cells to a cell type ? Do we try to make the best cluster parameters that mimic the cells type we already know ? What if you work on "non model" specie for single cell ? With no cell atlas and no possibility for Data Integration ? (the most worked on this field I imagine are mice and humans).

Then, just to be sure : Do we cluster the cells on all the values in the reduced dimensionality ? Or on just a few ? We often use only two in our plots for obvious reasons. But Seurat objects are really tricky to approach for a newbie I think, and it's difficult to explore and see what is what.

An other thing that bothers me is about using clusters for DEA. We were warned about how flowed p-values were when doing DE on clusters. Because we already cluster cells on their expression profiles, so of course when comparing two cluster we are going to see "significant" p-values. So I understand we should not say of very low p-value when comparing two clusters that it is very significant, but then how do you know what is significant ? Do you also use Fold Change for example ?

Sorry if I was unclear, I think my thoughts are not organized well on the topic yet :) Thank you for your input.

scRNAseq Seurat clustering • 1.8k views

ADD COMMENT • link updated 3.5 years ago by Pratik ★ 1.1k • written 3.5 years ago by Rox ★ 1.4k

score 8 · Answer 1 · 2021-10-22

Hi Rox

I'm just going to answer this one in part: Why clustering exactly?

Do we try to make the best cluster parameters that mimic the cells type we already know ?

For me I usually try to do this as a sanity check, first. For example if I am focusing on the scRNA-seq data of the pancreas. I will try to get all insulin-producing beta cells (+INS) usually in there own cluster. I will try to get all glucagon-producing alpha cells (+GCG) in their own cluster. I will continue this for the "main" cell types for that organ/tissue. In Seurat, I do this with the FeaturePlot() function to visualize where the gene is expressed within UMAP space, and compare this to my clusters by visualizing the clustering in UMAP space as well. In scanPy, i will use, I think sc.pl.umap and color by gene and also clustering method (leiden, lovain, etc.).

So all in all it may look something like this (I did cluster a little bit more stringently to break apart the beta cells into three clusters rather than one cluster):

endocrine

You can see that grehlin-producing epsilon cells did not seperate into their own cluster based on the clustering parameters. I could have done even more stringent parameters to seperate the cluster, however all the other clustering would have been effected by the stringency as well (Maybe instead of 28 clusters, I would have had 40+ clusters Sometimes it comes down to how much you want to break apart the clusters/what's realistic/what's representative of the biology/what's your purpose. I could have also subset the islet of Langerhans cells and clustered them individually. or subclustered with the whole dataset, but I didn't cause this was sufficient for my purposes. My working goal was to characterize the mesenchymal population.

And then once I have done this sanity check. I go on to to characterizing the other cell types by differential gene expression analysis (which is just fancy words for saying what genes are expressed differently in each cluster compared to either all the clusters or groups of clusters or one cluster to another cluster). This helps me name the other cell types. This part can get challenging, however, it is possible to just systematically name cell types (and sub cell types, for that matter) using known markers (and novel markers that you found that are co-expressed with the known markers [negative or positive markers count too]).

That could be tedious there are packages that will name (or annotate) clusters with cell type names. So you don't have to go through that trouble.

But it kind of depends on your purpose too. I mean if you just want to see if a gene is expressed in a certain datset. All you really have to do is load up the dataset in your favorite package (ie. Seurat), run it through some basic pipeline you have created without paying too much mind to the parameters for clustering (how many PC's etc, resolution, etc.), and just use FeaturePlot() to see if the gene(s) is/are expressed. If it is expressed, by you just qualitatively looking for high expression. Then you can go back and name the cell types, and say for example, Gene(s) W was expressed in celltype X in Y tissue/organ in Z disease

This is kind of my current project right now, mining datasets looking for a pathways gene expression in certain disease types, so it's somewhat fresh.

Hope this helps!