Hi all,
What do you think about the feature selection part in single-cell RNA analysis? I am familiar with Seurat and 10X platform (Cell Ranger ). I've noticed that the default setting of Seurat is using 2000 HVGs (Highly Variable Genes) for dimensional reduction (PCA/tSNE/UMAP) but the default setting of Cell Ranger is using the whole features(genes) instead.
In the aspect of computing efficiency, feature selection could reduce dimensions and thus speed up the calculations., but which also has a prior assumption such that those differences are due to biological difference between the cells rather than technical noise. (see 6.3.1.1 Highly Variable Genes) Personally, I do realize the advantages of this feature selection approach, which is also a basic concept of machine learning.
However, there are several questions that haunted me for a while:
(1) How many features do we choose? How do you guys test this?
(2) What if the clustering results based on the whole features and 2000 HVGs are very different? I've once seen data with two cell samples (WT/MUT) using 2000 HVGs, which shows no "batch effect" (cells were mixed with no correlation between samples); on the other hand, using all genes, which shows strongly "batch effect" (cells were clustered by WT/MUT obviously). This could definitely affect the decision of whether to do the batch correction (Seurat integration, MNN...etc.) or not.
Any thoughts, opinions, suggestions would be totally appreciated.
Thank you!
In my opinion: 1) feature selection is not necessary; 2) you should always validate your clustering with known markers to check whether this clustering is reasonable, and adjust your parameters accordingly; 3) pca is just linear transformation. It won't give you accurate results for complicated dataset.
Hi, thanks for the suggestions!