I'm working on a project involving analyzing scRNA-seq data. A large part of the project involves clustering cells, identifying DE genes between clusters, pathway analysis of the DE genes, etc. To do the analysis, I am planning to use Seurat. By default, Seurat uses the graph-based Louvain algorithm to cluster cells. So that would seem to indicate that it is important that the 2D embedding generated by t-SNE or UMAP is as accurate as possible so that the clusters are also maximally accurate.
Prior to doing t-SNE or UMAP, Seurat's vignettes recommend doing PCA to perform an initial reduction in the dimensionality of the input dataset while still preserving most of the important data structure. Seurat is definitely not the only pipeline to do this; it seems to me that most analysis pipelines use PCA prior to t-SNE / UMAP basically like Seurat does. However, it also seems to me that ICA is generally better at dividing cells based on the activation of gene modules than PCA. This seems to me to make sense in principle - i.e. gene modules behave more like independent gene combinations (as modeled by ICA) than orthogonal gene combinations (as modeled by PCA) - and also in practice - i.e. I've read a few papers presenting empirical evidence that ICA is better than PCA for differentiating cells based on gene module activation. Assuming this is correct, would it make more sense to use ICA rather than PCA to do the pre-t-SNE / UMAP dimensionality reduction? Or is there a compelling reason that most people seem to use PCA for this that I am simply unaware of?
I think the main reason dimensionality reduction is performed before t-SNE is because of the poor performance of t-SNE with high dimensional data (this could be due to the difficulty in finding the right parameters in such situation). UMAP seems better in this respect but this is anecdotal. This paper compares t-SNE and UMAP on single cell data. I would also add that UMAP can do metric learning: it can be used to learn a projection that best separates annotated samples then used to project unannotated samples in this space. This can be quite useful if one has annotated samples.
I understand why dimensionality reduction is done prior to t-SNE, I'm just curious if there's a reason that people chose to use PCA to perform this initial round of dimensionality reduction rather than ICA.
The choice is typically guided by some assumptions about the data and what the goal of the transformation is. PCA assumes that the only relevant components to explain the variability in the data are the uncorrelated ones. This generally works well for multivariate Gaussian distributions because in this case, uncorrelated also means statistically independent. ICA assumes the data to be generated by statistically independent non-Gaussian sources (it's a form of blind source separation like NMF) and tries to identify them by minimizing statistical dependence of the components. Why use ICA for non-Gaussian sources? Because in this case, uncorrelated doesn't imply statistical independence so PCA wouldn't necessarily recover the desired components. The downside of ICA is that there's no ranking of the components (i.e. there's no relationship between an ICA with k-1 components and one with k components unlike in PCA).
Ok, makes sense, thank you!
I am not sure if I understand this well. Could you explain this sentence a little bit further?
Seems to me that one of the major difference bewteen PCA and ICA is their assumption on Gaussian distribution of components and subsequent definition of independence. Could you elaborate more in terms of RNA-Seq? For example, which assumption make more sense on RNA-Seq?
PCA finds orthogonal directions along which the variance is maximal. Orthogonality means that the variables are now uncorrelated. If the data is Gaussian, lack of correlation implies statistical independence but PCA doesn't make any assumption on the data. ICA tries to model the data as the sum of statistically independent non-Gaussian components by finding directions that maximize the "non-Gaussianity" of the data. Therefore the recovered components can be correlated. Which approach to use depends one what you're trying to do and what's important in your context. People sometimes find that that PCA components are hard to interpret (they are linear combinations of the original variables) and sometimes NMF or ICA can give more interpretable factors. Maybe this review on the use of ICA on omics data will be helpful.
Thanks for explaining. Really helpful. I understand that orthogonal transformation done on Gaussian distributed data ensures uncorrelatedness as well as independence. Last question: Take RNA-Seq as example, does this Gaussian distribution refer to the expression of numerous gene in specific biological process (pathway) or the expression of specific gene among samples?
I am trying to get a rough idea of whether PCA or ICA makes more sense to perform molecular subtyping of tumor samples based on expression data (RNA-Seq).
Typically in RNA seq, the values are some form of read counts which tend to have variance that depends on the mean which in turn means that PCA will be influenced by the genes/transcripts with the highest level of expression if no suitable normalization is applied. You can read more about this in this tutorial. When trying to decide between PCA and ICA, think of ICA as a method for unmixing (statistically independent) signals whereas PCA doesn't recover original signals but projects the data to maximize variance. Since the only information used in PCA is the covariance, to retrieve useful clusters in PCA space, covariance needs to contain information on similarity and the features needs to have reasonably linear relationships for the principal components to be sensible.