I would strongly recommend to transform your data to log scale before running PCA. You commonly select genes for PCA based on the most variable genes (so variance). The thing is that there is a strong relationship between the mean and the variance if not working on the log scale. That means the higher the counts, the higher the variance, so selecting by variance without log-transformation will just give you genes with high expression level, but what you want is the variance driven by the differences between the samples, so PCA separation actually tells you about the differences between the samples without the confounding effect of the expression level.
Lets have an example with TPM data, tpm
is a numeric matrix with TPMs in this case just two n=2 from two different cell types:
library(UpSetR)
library(matrixStats)
#/ Plot the mean vs the variance of this two-column tpm matrix with and without log2-ing the TPMs:
mean.nolog <- rowMeans2(tpm)
rv.nolog <- rowVars(tpm)
mean.log <- rowMeans2(log2(tpm+1))
rv.log <- rowVars(log2(tpm+1))
#/ Select top-500 most variable genes besed on rowwise variance:
mostVar.nolog <- head(order(rv.nolog, decreasing = TRUE), 500)
mostVar.log <- head(order(rv.log, decreasing = TRUE), 500)
#/ Pearson correlation between mean and variance:
cor.nolog <- round(cor(mean.nolog, rv.nolog), 2)
cor.log <- round(cor(mean.log, rv.log), 2)
#/ Plot:
par(mfrow=c(2,1), bty="n")
#/ put "nolog" results on log scale for visualization (not the TPMs itself):
plot(log2(mean.nolog+1), log2(rv.nolog+1), pch=20, main = paste("Pearson:", cor.nolog))
plot(mean.log, rv.log, pch=20, main = paste("PearsonCor:", cor.log))
dev.off()
#/ Visualize the very limited overlap between the HVGs using upset:
print(upset(fromList(list(noLog = mostVar.nolog, Log = mostVar.log))))
See the below plots, the mean/variance dependency confounds the selection of genes by expression level, which is usually not what you want, therefore log2-transformation is desirable. This effect is called variance stablilization. There are dedicated approaches to do this, such as the vst
function from DESeq2 when starting from raw gene level counts, but usually the log is a sufficient proxy for this and commonly used. As you'll see in the upset plot below the overlap between the top variable genes is very small between the two methods, I would use the ones based on the log-transformed counts.
From there you can go ahead with PCA, either using code inspired from the DESeq2 plotPCA
function or check PCAtools
at Bioconductor.
No need to perform log transformed to TPM data.