Correcting RNA-seq matrix by removing principal components gives negative values in the matrix
0
0
Entering edit mode
4.1 years ago
guillepalou4 ▴ 20

Hi guys,

I have an RNA-seq matrix with Transcripts-Per-Million (TPM) values on the cells. It contains 87 samples and 25307 transcripts (after removing 0 counts and low-expressed genes). Values ranges from 0 to 77499.55.

I wanted to perform a PCA and remove the first 4 principal components to correct my matrix for global sources of gene expression variation that I am not interested in. This is usually done to increase power in gene expression analysis (for instance here --> https://www-nature-com.sare.upf.edu/articles/ng.3173#Sec19). However, after doing so, my matrix with TPM counts contains negative values. I think it is expected and mathematically correct, but I cannot compare expression values between genes if some of them are negative and some of them positive. I wonder how they did this in the paper above-mentioned (not explained)... I tried to do it with two different functions:

1) prcomp (stats R package). I read here (https://stats.stackexchange.com/questions/229092/how-to-reverse-pca-and-reconstruct-original-variables-from-several-principal-com) how to do it, so I followed the steps.

mu <-colMeans(t(RNAseq.matrix.filt))
res.pca <- prcomp(t(RNAseq.matrix.filt), scale = FALSE)
nComp <- 4
RNAseq.matrix.pcs.removed <- res.pca$x[,-c(1:nComp)] %*% t(res.pca$rotation[,-c(1:nComp)])
RNAseq.matrix.pcs.removed  <- scale(RNAseq.matrix.pcs.removed , center = -mu, scale = FALSE)
range(RNAseq.matrix.filt)
[1]     0.00 77499.55
range(RNAseq.matrix.pcs.removed)
[1] -6500.927 39084.074
RNAseq.matrix.pcs.removed[1:5,1:5]

                              uc010zzl  uc001jiu  uc003wfr   uc003wft uc003wfu
TCGA.SC.A6LQ.01A.11R.A34F.07  9.571990  53.51113 3.2032865  1.7377768 4.106314
TCGA.UT.A88G.01B.11R.A40A.07 14.428840  40.20722 6.2185200  0.7639769 4.799665
TCGA.SC.A6LN.01A.11R.A34F.07  8.124673 111.84805 0.3601852 -0.3603312 1.061221
TCGA.UD.AAC4.01A.11R.A40A.07 15.793696  95.13709 1.8790241  1.9879911 8.361486
TCGA.MQ.A6BL.01A.11R.A34F.07 10.283230  92.35297 2.4699114  1.5041390 3.269851

2) reconst (FactoMineR R package). I created a new function using the code of "reconst" so I could remove the first 4 PC's, instead of keeping only them:

    remove.PCs <- function (res, nComp = NULL){

      if (inherits(res, "PCA")) {
        coord.var = as.matrix(res$var$coord)[, -c(1:nComp), drop = F]
        coord.ind = as.matrix(res$ind$coord)[, -c(1:nComp), drop = F]
        hatX = coord.ind %*% t(sweep(coord.var, 2, sqrt(res$eig[-c(1:nComp),1]), FUN = "/"))
      }
      if (inherits(res, "PCA")) {
        hatX = sweep(hatX, 2, res$call$ecart.type, FUN = "*")
        hatX = sweep(hatX, 2, res$call$centre, FUN = "+")
      }
      return(hatX)
    }

res.pca <- PCA(t(RNAseq.matrix.filt), scale.unit = FALSE, nComp  = 100)
RNAseq.matrix.pcs.removed <- remove.PCs(res.pca, nComp  = 4)
range(RNAseq.matrix.pcs.removed)
[1] -6500.927 39084.074
RNAseq.matrix.pcs.removed[1:5,1:5]

                              uc010zzl  uc001jiu  uc003wfr   uc003wft uc003wfu
TCGA.SC.A6LQ.01A.11R.A34F.07  9.571990  53.51113 3.2032865  1.7377768 4.106314
TCGA.UT.A88G.01B.11R.A40A.07 14.428840  40.20722 6.2185200  0.7639769 4.799665
TCGA.SC.A6LN.01A.11R.A34F.07  8.124673 111.84805 0.3601852 -0.3603312 1.061221
TCGA.UD.AAC4.01A.11R.A40A.07 15.793696  95.13709 1.8790241  1.9879911 8.361486
TCGA.MQ.A6BL.01A.11R.A34F.07 10.283230  92.35297 2.4699114  1.5041390 3.269851

Both matrices using these two functions are the same, but I wonder if having negative values is correct, and in that case, what people often do with them? Any help?

Thank you!

Guille

RNA-Seq principal components TPM PCA • 1.3k views
ADD COMMENT

Login before adding your answer.

Traffic: 2201 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6