Is feature selection necessary before dimensional reduction in single-cell analysis?
3
1
Entering edit mode
5.2 years ago
Phoe ▴ 20

Hi all,

What do you think about the feature selection part in single-cell RNA analysis? I am familiar with Seurat and 10X platform (Cell Ranger ). I've noticed that the default setting of Seurat is using 2000 HVGs (Highly Variable Genes) for dimensional reduction (PCA/tSNE/UMAP) but the default setting of Cell Ranger is using the whole features(genes) instead.

In the aspect of computing efficiency, feature selection could reduce dimensions and thus speed up the calculations., but which also has a prior assumption such that those differences are due to biological difference between the cells rather than technical noise. (see 6.3.1.1 Highly Variable Genes) Personally, I do realize the advantages of this feature selection approach, which is also a basic concept of machine learning.

However, there are several questions that haunted me for a while:

(1) How many features do we choose? How do you guys test this?

(2) What if the clustering results based on the whole features and 2000 HVGs are very different? I've once seen data with two cell samples (WT/MUT) using 2000 HVGs, which shows no "batch effect" (cells were mixed with no correlation between samples); on the other hand, using all genes, which shows strongly "batch effect" (cells were clustered by WT/MUT obviously). This could definitely affect the decision of whether to do the batch correction (Seurat integration, MNN...etc.) or not.

Any thoughts, opinions, suggestions would be totally appreciated.

Thank you!

scRNA-seq feature seleciton single cell • 4.8k views
ADD COMMENT
1
Entering edit mode

In my opinion: 1) feature selection is not necessary; 2) you should always validate your clustering with known markers to check whether this clustering is reasonable, and adjust your parameters accordingly; 3) pca is just linear transformation. It won't give you accurate results for complicated dataset.

ADD REPLY
0
Entering edit mode

Hi, thanks for the suggestions!

ADD REPLY
1
Entering edit mode
5.2 years ago
Mensur Dlakic ★ 28k

A simple response is NO.

However, depending on the number of starting features and the exact method used for dimensionality reduction, it may be helpful to select features beforehand. Some of the arguments I will present here briefly have been made already in this post.

Dimensionality reduction methods differ both in how they handle number of features and in linear/non-linear nature of new features. For example, PCA is reproducible (deterministic) and very fast, even for large number of features. Once you train a PCA model, it can be applied to new data that are in the same format as the original dataset. t-SNE is non-reproducible (at least not perfectly reproducible) and relatively slow with large number of features (anything over 30-50) or sample (>50000), but it produces more informative embedding and more intuitive clusters when there are non-linear relationships between features. UMAP scales well to large numbers of features and samples, and can capture non-linear relationships between features.

A more elaborate answer to your original question: 1) PCA is fast and does not require feature selection, but it will not produce informative plots in some complex cases; 2) as a pre-processing step for t-SNE, it is advisable to do PCA on data and specify 30-50 principal components as outputs if the number of original features is >>50; t-SNE will still be relatively slow and it is non-parametric (models can't be saved and applied to new datasets); 3) UMAP is slower than PCA but much faster than t-SNE, and it works well for large datasets; its models also can be saved. Even more to the point: PCA is fastest but does not always give clear cluster structure; t-SNE is slowest but often gives visually most pleasing result; UMAP is somewhere between the two both in terms of speed and visualization.

ADD COMMENT
0
Entering edit mode

Thank you. I think 2) is worth noticing, but as we are using single-cell data, the feature (here refers to genes) could >> 50 frequently right? Also, in your experience, if data points (cells) using same PCs were overlapped within different clusters (e.g. cluster 1 and cluster 2 were mixed together) of t-SNE plot, but they were separated explicitly in the UMAP, how would you explain this? What else will you check? Does it mean the t-SNE plot couldn't explain much of this complicated data?

ADD REPLY
1
Entering edit mode
5.2 years ago
igor 13k

I've noticed that the default setting of Seurat is using 2000 HVGs (Highly Variable Genes) for dimensional reduction (PCA/tSNE/UMAP) but the default setting of Cell Ranger is using the whole features(genes) instead.

Since you mentioned Seurat specifically, according to the Seurat developers (GitHub):

we typically do not notice large differences in the analysis depending on the exact number of genes selected- ranging from 2k genes to even the full transcriptome

ADD COMMENT
0
Entering edit mode

Thanks! I've also seen this, despite the user was asking about the integration method of Seurat.

Indeed, the number of features used does affect the output of all single-cell analyses (including clustering, integration, pseudotime, etc.). Unfortunately we can't advise on the exact value to choose, but agree that the sensitivity of some analyses to this parameter can be frustrating. Our best suggestion is to use the SCTransform workflow, which weights genes in downstream analysis based on their biological variation. As a result, adding more genes into the analysis makes less of a difference, because they have lower weights. As a result, we find that the results exhibit less sensitivity based on the number of features included.

ADD REPLY
1
Entering edit mode

In that same issue, although the results are somewhat different with the different number of genes, it's not clear which version is better. It's possible that both representations are equally inaccurate.

ADD REPLY
0
Entering edit mode

Found another description similar to what Igor provided.

Generally, we find that 2-3K genes tend to work well for most datasets that we analyze (and that's what we use in all vignettes).

https://github.com/satijalab/seurat/issues/1989

ADD REPLY
0
Entering edit mode
5.2 years ago
Phoe ▴ 20

Hi all, I found this review very informative, which mentioned some critical points in unsupervised clustering.

ADD COMMENT

Login before adding your answer.

Traffic: 2711 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6