Entering edit mode
13 months ago
Chris
▴
340
Hi Biostars,
When I read articles or tutorial about clustering in single cell, I noted that the clusters clearly separate. However, in my analysis, clusters of each cell types are not clearly separate even though I tried different clustering algorithms (louvain, leiden) or sctransform for normalization. Would you please have a comment on this issue? Thank you so much!
Are you doing the clustering in UMAP space? If so, avoid distance-based clustering because distances are only meaningful locally, i.e distances between clusters are meaningless. I typically use (H)DBSCAN with UMAP. What you can do to try and improve the outcome are:
1- play with the UMAP parameters, low min_dist and higher n_neighbors would tend to make for more concentrated clusters
2- do the clustering with more than 2 UMAP components
Finally there's nothing wrong with clusters not being perfectly separated. This could be an indication that some cells for example are in an ntermediate state/transition between types.
wait a minute, we were always told to not trust the distance-distorting embedding methods like UMAP or t-SNE to make any serious assumption about the data. And that obviously includes clustering of the cells. That is why Seurat's tutorial relies on PCA for clustering of the cells (even if only on the first 30 components by default, but that's still normally counts as no distortion of the data).
On the other hand, I myself was often tempted to use UMAP as the ground for clustering, especially when the cells clustered by PCA end up on the different extremites of the UMAP plot. The close distances between cells, which are emphasized by the UMAP, most likely do make sense biologically.
UMAP is used to reveal clusters that aren't typically revealed by linear methods such as PCA. Now that you see what you believe are meaningful clusters (e.g. they segregate with some relevant characteristic like phenotype or cell type), why not recover these clusters? Although one needs to be careful, you can use UMAP as a dimensionality reduction step before clustering. See the UMAP docs section on clustering and the SE discussion linked there.
Exactly, I agree, that was also my line of thinking. But i've never seen anyone doing that, so I was a bit hesitant..
Why do you think clusters of cells not clearly separated is a problem?
Just because the articles and tutorials I read, they usually clearly separate which I worry I did something wrong.
How distinct are the cells in your dataset? In the demo, the cell types differ significantly, leading to clear separation. To select the right parameters, refer to demo p.199.
I'd like to clarify a point made by Jean-Karim. Unless I'm mistaken, UMAP serves as a visualization tool for dimension projection. It doesn't compute distances for clustering methods based on its embeddings. Typically, clustering relies on a predefined number of PCs. UMAP's primary function is to simplify high-dimensional data to 2D or 3D and then label those data points with relevant information, such as clusters in this context. Moreover, HDBSCAN utilizes distances for cluster formation.
I've experimented with Agglomerative Clustering (which bears similarities to HDBSCAN), Kmean, Gaussian Mixture Model, Louvain, and Leiden. When clustering cells into the same number of groups, the outcomes are fairly consistent across methods.
If your cell types closely resemble each other, and you wish to differentiate cell populations, consider exploring various techniques to recognize the non-linear patterns in your data. While PCA is standard for dimension reduction and excels at identifying linear relationships, gene regulatory networks aren't strictly linear. An option would be the scVI toolkit, which incorporates deep neural networks to highlight non-linearity— a strength inherent to DNNs.
note: vanilla UMAP doesn't require PCA as a pre-requisite, it's done in Seurat's tutorials just to speed up the embedding calculation
Thank you for your reply! I integrate the wild type and knock out then clustering. How can I know how distinct of the cells in my dataset? I don't know why the t-SNE or UMAP on Loupe browser looks very different when using Seurat.
t-SNE and UMAP have subtle differences, with UMAP often perceived as better at preserving the global structure. Both methods rely on a stochastic process, so results can vary each time unless the same random state is set.
Regarding cell distinctiveness, are your cells from a homogeneous cell type? Many online tutorials utilize the PBMC dataset, which comprises distinct cell types like T cells, B cells, and so on. If your visualizations aren't showing clear separations, it could be because the cells are closely related. Observing a singular spherical cluster might indicate that the features are too sparse. Just to double-check, I assume you did this already. Have you performed PCA and utilized it for UMAP and clustering? You might consider sharing your plot, as it could be beneficial to receive suggestions or insights from others.
My cells are from a tissue with heterogeneity. Yes, I did PCA. What other plots you would like to see?
If your cells are from a tissue, I would imagine some separation though. Have you tried different UMAP parameters, e.g., spread and min_dist?
No, I haven't. I just run with default. bk11 said it is not a problem if cells are not clearly separate. Is that possible because errors in my code make clusters not clearly separate?
Without access to your code, it's challenging to determine the exact issue. Nonetheless, I recommend revisiting the page I initially shared and experimenting with UMAP's hyperparameters to see if that enhances the separation. As bk11 mentioned, a lack of clear distinction might not indicate an actual problem; adjusting the UMAP hyperparameters could improve the visualization.
Thank you for sharing. The material is about 300 pages.
I specified the page 199 for you.
I tried spread and min_dist but the UMAP didn't look better. Thank you.
hi, i am facing the same problem. can you tell me how to solve it
Try a few resolution. If it still not clear separate, I think it is the nature of your data.
an interesting pointer to scVI toolkit, thanks! It seems to get a lot of traction lately.