Hi guys,
I have an RNA-seq matrix with Transcripts-Per-Million (TPM) values on the cells. It contains 87 samples and 25307 transcripts (after removing 0 counts and low-expressed genes). Values ranges from 0 to 77499.55.
I wanted to perform a PCA and remove the first 4 principal components to correct my matrix for global sources of gene expression variation that I am not interested in. This is usually done to increase power in gene expression analysis (for instance here --> https://www-nature-com.sare.upf.edu/articles/ng.3173#Sec19). However, after doing so, my matrix with TPM counts contains negative values. I think it is expected and mathematically correct, but I cannot compare expression values between genes if some of them are negative and some of them positive. I wonder how they did this in the paper above-mentioned (not explained)... I tried to do it with two different functions:
1) prcomp (stats R package). I read here (https://stats.stackexchange.com/questions/229092/how-to-reverse-pca-and-reconstruct-original-variables-from-several-principal-com) how to do it, so I followed the steps.
mu <-colMeans(t(RNAseq.matrix.filt))
res.pca <- prcomp(t(RNAseq.matrix.filt), scale = FALSE)
nComp <- 4
RNAseq.matrix.pcs.removed <- res.pca$x[,-c(1:nComp)] %*% t(res.pca$rotation[,-c(1:nComp)])
RNAseq.matrix.pcs.removed <- scale(RNAseq.matrix.pcs.removed , center = -mu, scale = FALSE)
range(RNAseq.matrix.filt)
[1] 0.00 77499.55
range(RNAseq.matrix.pcs.removed)
[1] -6500.927 39084.074
RNAseq.matrix.pcs.removed[1:5,1:5]
uc010zzl uc001jiu uc003wfr uc003wft uc003wfu
TCGA.SC.A6LQ.01A.11R.A34F.07 9.571990 53.51113 3.2032865 1.7377768 4.106314
TCGA.UT.A88G.01B.11R.A40A.07 14.428840 40.20722 6.2185200 0.7639769 4.799665
TCGA.SC.A6LN.01A.11R.A34F.07 8.124673 111.84805 0.3601852 -0.3603312 1.061221
TCGA.UD.AAC4.01A.11R.A40A.07 15.793696 95.13709 1.8790241 1.9879911 8.361486
TCGA.MQ.A6BL.01A.11R.A34F.07 10.283230 92.35297 2.4699114 1.5041390 3.269851
2) reconst (FactoMineR R package). I created a new function using the code of "reconst" so I could remove the first 4 PC's, instead of keeping only them:
remove.PCs <- function (res, nComp = NULL){
if (inherits(res, "PCA")) {
coord.var = as.matrix(res$var$coord)[, -c(1:nComp), drop = F]
coord.ind = as.matrix(res$ind$coord)[, -c(1:nComp), drop = F]
hatX = coord.ind %*% t(sweep(coord.var, 2, sqrt(res$eig[-c(1:nComp),1]), FUN = "/"))
}
if (inherits(res, "PCA")) {
hatX = sweep(hatX, 2, res$call$ecart.type, FUN = "*")
hatX = sweep(hatX, 2, res$call$centre, FUN = "+")
}
return(hatX)
}
res.pca <- PCA(t(RNAseq.matrix.filt), scale.unit = FALSE, nComp = 100)
RNAseq.matrix.pcs.removed <- remove.PCs(res.pca, nComp = 4)
range(RNAseq.matrix.pcs.removed)
[1] -6500.927 39084.074
RNAseq.matrix.pcs.removed[1:5,1:5]
uc010zzl uc001jiu uc003wfr uc003wft uc003wfu
TCGA.SC.A6LQ.01A.11R.A34F.07 9.571990 53.51113 3.2032865 1.7377768 4.106314
TCGA.UT.A88G.01B.11R.A40A.07 14.428840 40.20722 6.2185200 0.7639769 4.799665
TCGA.SC.A6LN.01A.11R.A34F.07 8.124673 111.84805 0.3601852 -0.3603312 1.061221
TCGA.UD.AAC4.01A.11R.A40A.07 15.793696 95.13709 1.8790241 1.9879911 8.361486
TCGA.MQ.A6BL.01A.11R.A34F.07 10.283230 92.35297 2.4699114 1.5041390 3.269851
Both matrices using these two functions are the same, but I wonder if having negative values is correct, and in that case, what people often do with them? Any help?
Thank you!
Guille