I am under the impression that, in general, PCA is used for RNA-Seq but tSNE is used for scRNA-seq.
Can anyone share some comments on why this is the case? is it because some intrinsic difference between mRNA and scRNA?
I am under the impression that, in general, PCA is used for RNA-Seq but tSNE is used for scRNA-seq.
Can anyone share some comments on why this is the case? is it because some intrinsic difference between mRNA and scRNA?
This has more to do with the goal of the techniques. PCA in bulk RNAseq is intended for QC to see if there are outliers. tSNE is used in scRNA-seq is used to find cell-types or other groups. You could use tSNE in place of PCA in bulk RNAseq, but since there are parameters to tweak and it's more computationally expensive there's no great benefit to do so.
My impression is that one of the main reasons is the simple fact that tSNE hasn't been around as long as PCA (Reference) Plus, PCA tends to work well on bulk RNA-seq data and mathematically, its application for detecting outliers as Devon pointed out, makes sense since it's calculating the vectors along which the variation is maximal -- in bulk RNA-seq, you typically want the variation to stem from experimental factors rather than from individual samples, so PCA is a good check for that. In scRNA-seq you typically don't have many different samples, instead you have thousands of different cells, usually stemming from only one or a handful of different samples.
There's a decent non-mathematical description of the relative features of PCA and tSNE (as well as diffusion maps) in this review.
I think tSNE is capturing the local relationships between points, like treating your data as a network where your cells are nodes. PCA is calculating the "true" distances between points after we consider variation in your dataset. Mahalanobis distance can do similar things.
Applying PCA before tSNE is in fact projecting your data into a low-dimensional subspace, where the distances between points are more real and therefore you could obtain more real local relationships between points.
I think this is largely because how the clusters are distributed in sample space. For instance, in cancer research we usually have overlapping Gaussian clusters. Often quite a simple structure. Both PCA and tSNE work fine to show these structures in my experience. Sometimes, but rarely, the structures in some datasets may be more complex, towards single cell RNA-seq complexity, and tSNE works better in these situations.
In single cell RNA-seq oftentimes we have far more complex structures usually consisting of many globular clusters (cell types) of different sizes and variance arranged in complex patterns in sample space. tSNE can capture complex non-linear structures well, PCA can't.
Edit: you may want to look into the UMAP algorithm.
I think the main advantage to tSNE is that it will space out the data points, but a disadvantage can be you may sometimes want to be careful about over-interpreting some swirly shapes within the tSNE plot (and PCA is often OK with the smaller number of samples for RNA-Seq project). However, Devon is also correct that tSNE is often used for separting cell types (and I would expect that to be increasingly true as larger numbers of cells/samples are considered).
It may also be helpful to review Difference between tSNE and PCA analysis, which I used to confirm that PCA is used as an upstream step for tSNE.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Already some fine answers, but just a quick comment. Both of these techniques are fundamentally different in terms of their mathematics and their goals. Performing tSNE on bulk RNA-seq does not make much sense to me because, being 'bulk', it's usually just a heterogeneous mixture of cells that have been extracted from a biopsy, and we are not to know (from looking at the data) which expression signals come from which cells.
tSNE is used on scRNA-seq because this type of seq gives us expression values on a cell-wise basis, so, tSNE is one of many methods that looks for relationships between these cells and attempts to assign groups of cells into cell populations that way. A similar thing is performed in CyTOF analysis.
For bulk RNA-seq, an equivalent to tSNE applied to scRNA-seq data would be the method known as cell deconvolution, i.e., looking at the bulk data and trying to determine the proportion of, for example, immune cell-types in the data.
It should be pointed out that scRNA-seq pipelines allow you to perform PCA on your data prior to performing tSNE.
Can I say PCA focus on the sample distance (variance) and can show outliers. On the other hand, tSNE focus on local structure at expense of long distance information and may mis-cluster some samples together when actually they are far away?
I'd rather say that the main caveat is to visually make out patterns in the resulting plot and assign meaning to them without further corroboration. This is a great website to familiarize yourself empirically with tSNE results and the effects of both the parameters as well as the underlying data structures.