PCA and loadings: how to discriminate betwen real and background/noise signal?
1
1
Entering edit mode
4.3 years ago
camillab. ▴ 160

Hi,

I have a dataset with 7770 genes and 37 samples (these samples come from different datasets and at the moment I cannot use the raw data so each dataset has been converted into Z scores).

I run the PCA (center= T, scale=F, no log transformed) on them and I observed that my samples do not cluster together (or some of them don't, in theory you should be able to see the plot for the first 2 dimensions, I know overall the variation explained by PC1 and PC2 is only 50% so I should plot also the PC3)! so I wanted to know if there are any specific genes to drive my samples far apart from each other so I looked the the "loadings" (with a plotloadings function in PCAtools). I calculated the Horn's law to identify how many components I should keep (=7) and I looked at the loadings for the first 7 components(=138genes). I re-run the PCA on those 138 and all samples cluster together. how come?

Also I removed these 138 genes from my original dataset, re-run the PCA and the samples cluster more or less exactly as before. I was expecting a change in the plot since those 138 genes are responsible for the most variation according to the package.

Last, how can I discriminate if the loadings/most variable genes identified are real and not just background? is there any parameter I should look at specifically? should I look at the normalised reads instead the Z scores (so perform the PCA on those 7770 genes but instead of using the Z scores, I should use the RPKM/FPKM)?

Thank you very much

Camilla

enter image description here

PCA PCAtools loadings RNA-Seq R • 1.9k views
ADD COMMENT
0
Entering edit mode

The package you suggested seems to work on untransformed count data so not in my case.I got negative values on FPKM expression but also with limma and ComBat. I compute the PCA on normlised reads so it makes sense to me do the batch correction and re-run the PCA on the normalised reads to be able to compare them and see if there is a batch effect and if it has been corrected with either limma/ComBat.

the dataset (I have 37 samples):

# A tibble: 6 x 38
  gene       a1     a2     a3    b1    b2    b3
  <chr>   <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl>
1 7SK    1.73    0.559  0.283 86.0  68.7  15.3 
2 A2M    0.415   1.14   0.962 31.1  31.4  37.3 
3 A4GA~  0.382   0.176  0.316  5.34  4.47 10.0 
4 AAAS   3.13    5.22   5.02  28.8  24.2  19.9 
5 AACS  21.2    19.7   16.9   13.3  14.0  13.1 
6 AADAT  0.0351  0      0      6.02  5.14  5.15
# ... with 31 more variables: b4 <dbl>, b5 <dbl>,
#   b6 <dbl>, b7 <dbl>, b8 <dbl>, b9 <dbl>,
#   b10 <dbl>, b11 <dbl>, b12 <dbl>, b13 <dbl>,
#   b14 <dbl>, b15 <dbl>, b16 <dbl>, b17 <dbl>,
#   b18 <dbl>, b19 <dbl>, b20 <dbl>, b21 <dbl>,
#   b22 <dbl>, b23 <dbl>, b24 <dbl>, b25 <dbl>,
#   b26 <dbl>, b27 <dbl>, b28 <dbl>, b29 <dbl>,
#   b30 <dbl>, b31 <dbl>, c1 <dbl>, c2 <dbl>,
#   c3 <dbl>

here is the code:

library(limma)
batch <- c( 1, 1, 1, 2, 2, 2, 2, 2, 2,2, 2, 2, 2, 2, 2,2, 2, 2, 2, 2, 2,2, 2, 2, 2, 2, 2,2, 2, 2, 2, 2, 2, 2, 3,3,3 ) #identify batched
shared2 <- shared[,-(1)] #remove gene symbol column
y2 <- removeBatchEffect(shared2,batch=batch)

library(PCAtools)
library(tidyr)
y3 <- as.data.frame(y2) #convert as dataframe
y3 <- y3 %>% drop_na() #remove rows with NA
SE.matrix <-(as.matrix(y3)) #convert into matrix
idx <- colnames(SE.matrix) #define samples names

pRET <- pca(SE.matrix, scale=F) #compute PCA
biplot(pRET, lab = idx, title= "PCA shared genes",subtitle = "batch corrected") #biplot

and that's the data after the batch correction (just the first few samples and the first few rows, gene names are missing):

         a1        a2        a3        b1
1 11.699953 10.527037 10.250832 65.458811
2  9.327410 10.051305  9.874959 15.997706
3  2.840965  2.634769  2.775655  1.528488
          b2        b3        b4         b5
1 48.1810110 -5.235089 29.579511 18.1323110
2 16.2664063 22.136006 10.790906 19.0645063
3  0.6596177  6.224368  2.804108 -0.9509323
         b6        b7        b8        b9
1 -2.214589 -3.598889  0.856611 -3.206089
2 10.710806 11.061106 25.423006  9.474106
3  1.572348  2.771298  4.269548  6.374668

Is my code wrong to batch correct the dataset?

thank you!

Camilla

ADD REPLY
1
Entering edit mode

Hey again,

The package you suggested seems to work on untransformed count data so not in my case.I got negative values on FPKM expression but also with limma and ComBat

You mean limma::removeBatchEffect()? - it cannot be used on FPKM or any other normalised count data. The manual page states that the input object should be logged:

numeric matrix, or any data object that can be processed by getEAWP
containing log-expression values for a series of samples. Rows
correspond to probes and columns to samples.

So, if we have RNA-seq data, we could only apply limma::removeBatchEffect() to the logCPM values (EdgeR / limma-voom) or VST or rlog values (DESeq2).

----------------------

The fundamental issue here is [I think] that you only have FPKM expression units, correct? I think that we had a discussion in another thread about this, relating also to zFPKM. If you must batch correct the FPKM data, then I would at least log it and add a pseudo-count prior to batch correction. For example:

limma::removeBatchEffect(log2(x + 0.1))
ADD REPLY
1
Entering edit mode
4.3 years ago

Can you share some code, in particular when you filter for the 138 genes? If these are responsible for most variation, then filtering the dataset for these will, for all intents and purposes, not alter the bi-plots 'significantly'.

Last, how can I discriminate if the loadings/most variable genes identified are real and not just background? is there any parameter I should look at specifically? should I look at the normalised reads instead the Z scores (so perform the PCA on those 7770 genes but instead of using the Z scores, I should use the RPKM/FPKM)?

You can use the eigencorplot to check if any of the PCs are statistically significantly correlated to, e.g., RIN, batch, or some other potential confounder / unanticipated source of variation.

I am the PCAtools [main] developer and it's my view that the package should be used as one of many lines of evidence in the realm of feature selection, i.e., for choosing a broad range of features (here, genes) that are later further tested via differential expression, clustering, model prediction, regression, etc. PCAtools is obviously good for generic PCA functionality like QCing data, too.

For RPKM / FPKM, I would definitely be enabling scaling here.

ADD COMMENT
0
Entering edit mode

I am the PCAtools [main] developer and it's my view that the package should be used as one of many lines of evidence in the realm of feature selection, i.e., for choosing a broad range of features (here, genes) that are later further tested via differential expression, clustering, model prediction, regression, etc.<

I agree with you and that's what I am also trying to do but I wanted to know if those 138 are due to noise that might arise since the samples have not been sequenced at the same time. I uploaded the plots. let me know if you can see it for some reason I have problem with that!

#Tidy the dataset
allSE2bis2 <- shared_7770 %>% drop_na() #remove rows with NA from the merged filed
rnames <- allSE2bis2$gene #select name
allSE2bis2 <- allSE2bis2 [-c(1)] # remove gene symbol column
SE.matrix <-(as.matrix(allSE2bis2))
rownames(SE.matrix) <- rnames # assign row names
idx <- colnames(SE.matrix)

#cpmpute pca
library(PCAtools)
p <- pca(SE.matrix, scale=F)

#horn law
horn <- parallelPCA(SE.matrix)
horn$n #retain 7 PC

#plot loadings
plotloadings(p,components = getComponents(p, seq_len(7)), title = 'Loadings plot', subtitle = 'PC1 - PC7 (not all genes are shown)', shape = 21, labSize = 4, absolute = T, shapeSizeRange = c(1, 12), drawConnectors = F)

plot loadings

in case not visible those are the variable retained for the first 7 PCs:

-- variables retained:
SCO2, HIST1H1C, AGR3, APP, EEF1A1, FTH1, RPL7, RPLP1, TECTA, TECTB, HES7, SCN1A, ANXA1, NT5E, PHF11, ABLIM2, ACBD7, ADI1, AK8, ALDH1A2, ALS2CL, AP2A2, ATP11A, BEGAIN, BICC1, BIK, BRD2, CARS, CASC4, CCDC13, CDO1, CGNL1, CHRDL1, CHST9, COG7, COL14A1, COL1A2, COL9A1, COLEC12, CPZ, CRLF3, CTTNBP2, DLK2, DNAJC7, DUSP19, EGF, EHD3, ENKUR, ENPP2, ENTPD1, EXOC4, FA2H, FABP5, FAIM2, FAM149A, FANCA, FGD3, FNDC7, GDAP1L1, GGH, HDAC11, HS3ST5, HVCN1, IL1RAP, ITM2A, ITPKB, JAM2, KDM2B, KIT, KLF13, LOXHD1, LRP2, LSM14A, MAOB, MATN4, MCTS1, METRNL, MGST3, MLXIP, MME, MMP2, MPRIP, MRPL45, MSRA, MYBPC1, MYLK3, NAP1L4, NDST1, NSG1, PAPSS2, PBX1, PCMTD2, PCYT1B, PDE10A, PDGFD, PDGFRL, PENK, PIAS2, PIP4K2A, PIP5K1B, PITPNM3, PRKG1, PROM1, PRRG1, PUF60, RAD17, RUNDC3B, SBNO2, SCRG1, SDC1, SFRP2, SH3GL3, SIK1, SLC17A8, SMOC2, SMTNL2, STARD5, STXBP6, SULT4A1, SUSD4, SYNE2, SYNRG, TBC1D9, TCF20, TEKT1, THSD4, TJP1, TMEM52, TNFAIP8, TNKS, TSPAN12, TYMS, U2AF1, VAPA, VAV3, VWA3B, WNT5A

then I create a new dataframe selecting the "variable retained":

#select vriable retained
variableRet <- shared_7770[shared_7770$gene %in% c("SCO2", "HIST1H1C", "AGR3", "APP", "EEF1A1", "FTH1", "RPL7", "RPLP1", "TECTA", "TECTB", "HES7", "SCN1A", "ANXA1", "NT5E", "PHF11", "ABLIM2", "ACBD7", "ADI1", "AK8", "ALDH1A2", "ALS2CL", "AP2A2", "ATP11A", "BEGAIN", "BICC1", "BIK", "BRD2", "CARS", "CASC4", "CCDC13", "CDO1", "CGNL1", "CHRDL1", "CHST9", "COG7", "COL14A1", "COL1A2", "COL9A1", "COLEC12", "CPZ", "CRLF3", "CTTNBP2", "DLK2", "DNAJC7", "DUSP19", "EGF", "EHD3", "ENKUR", "ENPP2", "ENTPD1", "EXOC4", "FA2H", "FABP5", "FAIM2", "FAM149A", "FANCA", "FGD3", "FNDC7", "GDAP1L1", "GGH", "HDAC11", "HS3ST5", "HVCN1", "IL1RAP", "ITM2A", "ITPKB", "JAM2", "KDM2B", "KIT", "KLF13", "LOXHD1", "LRP2", "LSM14A", "MAOB", "MATN4", "MCTS1", "METRNL", "MGST3", "MLXIP", "MME", "MMP2", "MPRIP", "MRPL45", "MSRA", "MYBPC1", "MYLK3", "NAP1L4", "NDST1", "NSG1", "PAPSS2", "PBX1", "PCMTD2", "PCYT1B", "PDE10A", "PDGFD", "PDGFRL", "PENK", "PIAS2", "PIP4K2A", "PIP5K1B", "PITPNM3", "PRKG1", "PROM1", "PRRG1", "PUF60", "RAD17", "RUNDC3B", "SBNO2", "SCRG1", "SDC1", "SFRP2", "SH3GL3", "SIK1", "SLC17A8", "SMOC2", "SMTNL2", "STARD5", "STXBP6", "SULT4A1", "SUSD4", "SYNE2", "SYNRG", "TBC1D9", "TCF20", "TEKT1", "THSD4", "TJP1", "TMEM52", "TNFAIP8", "TNKS", "TSPAN12", "TYMS", "U2AF1", "VAPA", "VAV3", "VWA3B", "WNT5A"), ]

#remove genes from"variable retained" from the original dataset
library(dplyr)
library(tidyverse)
shared <- anti_join (shared_7770, variableRet, by ="gene") #7633

#compute PCA again
allS2 <- shared %>% drop_na() #remove rows with NA from the merged filed
rnames <- allS2$gene#select name
allS2 <- allS2[-c(1)] # remove gene symbol
allS2matrix <-(as.matrix(allS2))
rownames(allS2matrix) <- rnames # assign row names
idx <- colnames(allS2matrix)

#PCA
pRET <- pca(allS2matrix, scale=F)
#plot
biplot(pRET, lab = idx)

2nd biplot

ADD REPLY
0
Entering edit mode

Hey, if you scale the data, how does the bi-plot change? The way that the samples group at left is somewhat worrying, particularly when considered along with the fact that all other samples are 'clumped' together into a 'furrball' at right.

Do you know what is the batch effect?, i.e., the sample-to-batch assignment?

Also, does allS22 exist in your environment?

ADD REPLY
0
Entering edit mode

allS22 my mistake. I corrected it.

if I scale that's the first plot: plot and this is are the loadings: loadings

ADD REPLY
0
Entering edit mode

I have tried to correct for batch effect using limma and I have plot with the barpots the dtaset before and after the correction but I don't understand if my dataset has batch effec that require to be corrected since the plots don't change much. before after correction.

ADD REPLY
0
Entering edit mode

A PCA bi-plot after batch correction would help.

ADD REPLY
0
Entering edit mode

I have noticed that some of the values batch-corrected are negative. Is it normal? I have tried also to correct with ComBat and I still have the negative values and in addition, a lot genes in quite a lot of samples has 0.00000000. Is it normal or is my code wrong? here the biplot after the batch correction with limma: biplot

ADD REPLY
0
Entering edit mode

You mean negative values in the PCA bi-plot? - those are eigenvalues, so, it is somewhat expected that they will be positive and negative.

Unfortunately, I cannot see your entire pipeline / code; so, I cannot comment on it. Can you share it, please?

If you are batch-correcting FPKM expression units via ComBat, then, yes, you will get negative values, which makes no sense because negative expression makes no biological sense. Batch correcting RNA-seq data via ComBat is not recommended. ComBat was designed for microarray data, which is [usually] measured on the log [base 2] scale.

However, there is now ComBat-Seq ( https://github.com/zhangyuqing/ComBat-seq ) which claims to be able to correct for batch in RNA-seq data.

Is FPKM all that you can get? FPKM units are not recommended for any analysis where cross-sample comparisons are being performed. If you can get the raw counts, you can have greater flexibility in your analysis pipeline, and then more easily follow the many tutorials that are on the World Wide Web.

ADD REPLY

Login before adding your answer.

Traffic: 2505 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6