Hi,
I have a dataset with 7770 genes and 37 samples (these samples come from different datasets and at the moment I cannot use the raw data so each dataset has been converted into Z scores).
I run the PCA (center= T, scale=F
, no log transformed) on them and I observed that my samples do not cluster together (or some of them don't, in theory you should be able to see the plot for the first 2 dimensions, I know overall the variation explained by PC1 and PC2 is only 50% so I should plot also the PC3)! so I wanted to know if there are any specific genes to drive my samples far apart from each other so I looked the the "loadings" (with a plotloadings
function in PCAtools
). I calculated the Horn's law to identify how many components I should keep (=7) and I looked at the loadings for the first 7 components(=138genes). I re-run the PCA on those 138 and all samples cluster together. how come?
Also I removed these 138 genes from my original dataset, re-run the PCA and the samples cluster more or less exactly as before. I was expecting a change in the plot since those 138 genes are responsible for the most variation according to the package.
Last, how can I discriminate if the loadings/most variable genes identified are real and not just background? is there any parameter I should look at specifically? should I look at the normalised reads instead the Z scores (so perform the PCA on those 7770 genes but instead of using the Z scores, I should use the RPKM/FPKM)?
Thank you very much
Camilla
The package you suggested seems to work on untransformed count data so not in my case.I got negative values on FPKM expression but also with
limma
andComBat
. I compute the PCA on normlised reads so it makes sense to me do the batch correction and re-run the PCA on the normalised reads to be able to compare them and see if there is a batch effect and if it has been corrected with eitherlimma
/ComBat
.the dataset (I have 37 samples):
here is the code:
and that's the data after the batch correction (just the first few samples and the first few rows, gene names are missing):
Is my code wrong to batch correct the dataset?
thank you!
Camilla
Hey again,
You mean
limma::removeBatchEffect()
? - it cannot be used on FPKM or any other normalised count data. The manual page states that the input object should be logged:So, if we have RNA-seq data, we could only apply
limma::removeBatchEffect()
to the logCPM values (EdgeR / limma-voom) or VST or rlog values (DESeq2).----------------------
The fundamental issue here is [I think] that you only have FPKM expression units, correct? I think that we had a discussion in another thread about this, relating also to zFPKM. If you must batch correct the FPKM data, then I would at least log it and add a pseudo-count prior to batch correction. For example: