Question

Correcting RNA-seq matrix by removing principal components gives negative values in the matrix

0

Entering edit mode

4.1 years ago

guillepalou4 ▴ 20

Hi guys,

I have an RNA-seq matrix with Transcripts-Per-Million (TPM) values on the cells. It contains 87 samples and 25307 transcripts (after removing 0 counts and low-expressed genes). Values ranges from 0 to 77499.55.

I wanted to perform a PCA and remove the first 4 principal components to correct my matrix for global sources of gene expression variation that I am not interested in. This is usually done to increase power in gene expression analysis (for instance here --> https://www-nature-com.sare.upf.edu/articles/ng.3173#Sec19). However, after doing so, my matrix with TPM counts contains negative values. I think it is expected and mathematically correct, but I cannot compare expression values between genes if some of them are negative and some of them positive. I wonder how they did this in the paper above-mentioned (not explained)... I tried to do it with two different functions:

1) prcomp (stats R package). I read here (https://stats.stackexchange.com/questions/229092/how-to-reverse-pca-and-reconstruct-original-variables-from-several-principal-com) how to do it, so I followed the steps.

mu <-colMeans(t(RNAseq.matrix.filt))
res.pca <- prcomp(t(RNAseq.matrix.filt), scale = FALSE)
nComp <- 4
RNAseq.matrix.pcs.removed <- res.pca$x[,-c(1:nComp)] %*% t(res.pca$rotation[,-c(1:nComp)])
RNAseq.matrix.pcs.removed  <- scale(RNAseq.matrix.pcs.removed , center = -mu, scale = FALSE)
range(RNAseq.matrix.filt)
[1]     0.00 77499.55
range(RNAseq.matrix.pcs.removed)
[1] -6500.927 39084.074
RNAseq.matrix.pcs.removed[1:5,1:5]

                              uc010zzl  uc001jiu  uc003wfr   uc003wft uc003wfu
TCGA.SC.A6LQ.01A.11R.A34F.07  9.571990  53.51113 3.2032865  1.7377768 4.106314
TCGA.UT.A88G.01B.11R.A40A.07 14.428840  40.20722 6.2185200  0.7639769 4.799665
TCGA.SC.A6LN.01A.11R.A34F.07  8.124673 111.84805 0.3601852 -0.3603312 1.061221
TCGA.UD.AAC4.01A.11R.A40A.07 15.793696  95.13709 1.8790241  1.9879911 8.361486
TCGA.MQ.A6BL.01A.11R.A34F.07 10.283230  92.35297 2.4699114  1.5041390 3.269851

2) reconst (FactoMineR R package). I created a new function using the code of "reconst" so I could remove the first 4 PC's, instead of keeping only them:

    remove.PCs <- function (res, nComp = NULL){

      if (inherits(res, "PCA")) {
        coord.var = as.matrix(res$var$coord)[, -c(1:nComp), drop = F]
        coord.ind = as.matrix(res$ind$coord)[, -c(1:nComp), drop = F]
        hatX = coord.ind %*% t(sweep(coord.var, 2, sqrt(res$eig[-c(1:nComp),1]), FUN = "/"))
      }
      if (inherits(res, "PCA")) {
        hatX = sweep(hatX, 2, res$call$ecart.type, FUN = "*")
        hatX = sweep(hatX, 2, res$call$centre, FUN = "+")
      }
      return(hatX)
    }

res.pca <- PCA(t(RNAseq.matrix.filt), scale.unit = FALSE, nComp  = 100)
RNAseq.matrix.pcs.removed <- remove.PCs(res.pca, nComp  = 4)
range(RNAseq.matrix.pcs.removed)
[1] -6500.927 39084.074
RNAseq.matrix.pcs.removed[1:5,1:5]

                              uc010zzl  uc001jiu  uc003wfr   uc003wft uc003wfu
TCGA.SC.A6LQ.01A.11R.A34F.07  9.571990  53.51113 3.2032865  1.7377768 4.106314
TCGA.UT.A88G.01B.11R.A40A.07 14.428840  40.20722 6.2185200  0.7639769 4.799665
TCGA.SC.A6LN.01A.11R.A34F.07  8.124673 111.84805 0.3601852 -0.3603312 1.061221
TCGA.UD.AAC4.01A.11R.A40A.07 15.793696  95.13709 1.8790241  1.9879911 8.361486
TCGA.MQ.A6BL.01A.11R.A34F.07 10.283230  92.35297 2.4699114  1.5041390 3.269851

Both matrices using these two functions are the same, but I wonder if having negative values is correct, and in that case, what people often do with them? Any help?

Thank you!

Guille

RNA-Seq principal components TPM PCA • 1.3k views

ADD COMMENT • link 4.1 years ago by guillepalou4 ▴ 20