I am looking for opinions (hands-on based experience) towards your favourit feature selection (followed by dimensionality reduction) method for 10X-based scRNA-seq data. The motivation for this is that I recently stumbled over the GLM-PCA approach from Rafael Irizarry's lab (links see on the bottom of the post) which made me dive into the literature. As expected there are plenty of methods out there, each claiming to perform superior. Since GLM-PCA operates on raw counts it frees the uses from choosing from one of the many normalization strategies such as the ones implemented in e.g. scran or the choices provided by Seurat. This is admittedly not at all a precise question (therefore Forum post), and I hope to initiate some chat here about your current best practices that users inexperienced in the single-cell world (including myself) can take inspiration from.
Hi, thanks for your interest in GLM-PCA (I'm one of the authors). First of all, GLM-PCA is a dimension reduction method meant to be as similar to PCA as possible but just using a count-based likelihood (or loss function) instead of the implicit normal distribution likelihood of PCA. Since you seem to be mostly interested in feature selection (ie identifying highly informative genes), I encourage you to check out our R package scry (soon to be submitted to bioconductor) which includes feature selection based on deviance as an alternative to the more traditional "highly variable genes" approach. As you mention it operates on raw UMI counts so no need for normalization, and according to a recent comparison by an independent research group has been shown to perform well vs competing methods. The scry package also includes a null residuals transformation (similar to the sctransform method from Hafemeister et al) that can be fed directly to traditional PCA instead of normalized counts. The null residuals are basically a rough approximation to GLM-PCA that are much faster to compute. Alternatively, if you have another normalization/dimension reduction scheme in mind, you can just use the deviance feature selection to choose say the top 2,000 genes then do whatever you like with those. As a side note, we are actively working to improve the scalability and numerical stability of the GLM-PCA optimization routine, so stayed tuned for those updates in the future.
I saw the GLM-PCA benefits. I believe that there are at least some scenarios where it does perform better. However, does it actually uncover new biological insights? Many single-cell methods make significant improvements on some metrics and look impressive on paper, but very few would actually change the conclusions that were based on classic techniques.
Personal anecdote: I tried not normalizing the data at all and expected completely nonsensical results. However, the major populations still clearly segregated.
Personal anecdote: I tried not normalizing the data at all and
expected completely nonsensical results. However, the major
populations still clearly segregated.
That is interesting observation indeed. Have you tried it with > n=1 to see if it is widely applicable?
My expectation is that you'd see fairly significant sample-to-sample effects with zero normalization, but would be interested in seeing if that's actually true.
That may be true. I normally see sample-to-sample effects regardless of normalization (without some sort of batch-correction methods like CCA/MNN/etc).
Think it also depends on sample. Normal PBMCs are fairly consistent between samples without batch correction through standard pipelines, assuming they're done fairly close to each other by the same person. Disease samples are a different story though.
Thanks will.townes for the pointer to the scry package. Will try.