Question

Which embedding for clustering of cells in single-cell RNAseq?

0

Entering edit mode

18 months ago

e.r.zakiev ▴ 250

There are multiple posts 1 2 3 4 on this website which tangentially touch on this quesion, but I haven't found any that ask directly this: Would you flirt with the idea of using UMAP, t-SNE, Diffusion Map, Force Atlas2, ICA or any other low-dimensional embedding as the basis for the cell clustering (whatever the clustering method downstream k-means, sNN, Louvain, etc.)?

It does happen more often than not in my analyses that with the standard Seurat pipeline of cluster definition via PCA -> kNN -> Louvain, the downstream UMAP cell embedding sometime puts cells from the same PCA cluster into two opposing extremities of the UMAP 2d representation. Even though UMAP makes sense when you look at the known marker genes (i.e. there are patterns of gradual decrease of expression or patches of locally highly concentrated expressing cells, for example)

We are always told that we should use PCA for cell clustering because it doesn't distort the euclidian distances between the cells, unlike all the methods mentioned above (except for ICA). but what if I LIKED my UMAP embedding more than the PCA? What if I thought that it had done a better job at emphasizing closer distances between the cells (see the last sentence of the previous paragraph) which probably correspond to cell states/cell types?

tSNE ICA clustering PCA UMAP • 929 views

ADD COMMENT • link 18 months ago by e.r.zakiev ▴ 250

score 1 · Answer 1 · 2023-10-18

I've been facing a similar challenge, and here are my thoughts on the points you brought up. I'm really interested in discussing this further and hearing what others think.

Low-dimensional embedding for visualization purposes: You talked about how UMAP sometimes splits cells from the same PCA cluster into two separate UMAP visuals. When we plot cells in 2D, they might look like they belong together, but if we consider higher dimensions like the 3rd, 4th, or 5th, they could be quite distant from each other. That's why we often need to consider many Principal Components (like 50-100) to get a good grasp of the original data layout. UMAP or t-SNE comes in handy over a simple 2D PCA plot because they are better at keeping the local and global relationships of cells across many dimensions.

Low-dimensional embedding for clustering: The goal here is to use embeddings to measure the "distance" between cells, helping us understand how similar they are in terms of transcriptional status. I ran a small test with various popular clustering methods on my own data, and the results were quite consistent. Among the clustering methods, Leiden and Louvain might be more biologically relevant in this scenario. However, I think the method used for calculating "distance" is crucial. A concern with PCA is that it only captures linear relationships, whereas gene interactions in regulatory networks can be nonlinear. The scVI toolkit, for instance, employs a Variational AutoEncoder (VAE) to catch these nonlinear relationships when measuring distances between cells.

So, circling back, I would experiment with different dimension reduction techniques, both PCA-based and VAE-based, and check if the results align well with biological interpretations.