Hi everybody!
I know that there are a lot of questions about how to adjust the RNA-Seq data if there is a batch effect in the samples, but after reading a lot of posts, questions and recommendations, I still have dobts on how to deal with my samples.
I have RNA-seq data from 2 different samples (control and treatment), which I want to compare and result in DEG list. For each sample, I have 3 replicates.
I have used edgeR to normalize the data (method=TMM), and after that I have performed the MDSplot and PCA plot. In both of them I see that there is a batch effect between the samples collected on the same day, as instead of seeing two separated groups between control and treatment, I see 3 groups with pairs of control_1
and treatment_1
, control_2
and treatment_2
and control_3
and treatment_3
. I see this in both MDSplot and PCAplot.
So I have been reading in biostars, and decided to perform batch effect correction in edgeR defining "batch" and unsing in in the design matrix for the DE analysis.
But to see if this adjusting of the batch effect works I have used "removeBatchEffect" function to then plot MDS and PCA and see if I obtain that separation. I think that for MDS plot more or less it is correct the separation, but again in PCA plot there is something with control_1 and treatment_1 that doesn't work even after the batch adjustment.
How can I solve this to be sure that when performing the DE analysis I am considering properly the samples and replicates, and finally have a good looking MDS and PCA plots?
Thanks in advance!!
Iraia
Please edit your original post and use these instructions to add images inline: How to add images to a Biostars post
Thanks for showing how to do it!
Strange. Which package is the "PCAplot" function from? Why is there a circle? and how much of the variation does the PCA 1 and 2 explain? I assume the "MDSplot" function is edgeR's own right?
How did you use the "removeBatchEffect"? What input did you give?
Hi again! Thanks for answering. I constructed the PCAplot from a short pipeline I found online in a blog. . Maybe someone can suggest a better way to create the PCA plot, I would be very happy if someone could share it with me.
In the before removing batch effect PCAplot the variation explain with PCA1 and PCA2 are:
In the after removing batch effect PCAplot the variation explain with PCA1 and PCA2 is:
On the other way, yes MDSplot function is from edgeR. And now I am going to write the pipeline used for edgeR normalization and plots generation to see if I am correct in the followed steps, and on the same way answerd your question about "removeBatchEffect" use:
I am going to try to add the images in another way following the guides "genomax" mentions.
If you need more information about my followed steps or something more tell me. Any appreciation, recomendation or help will be welcomed! Thansk,
Iraia
For the PCA bi-plots, you should be plotting the
x
component, not therotation
component. So, it should beTMM.pca$x
, notTMM.pca$rotation
. DESeq2's PCA function and my own PCAtools package plot thex
component.To read more:
This is not true. In RNA-seq and expression studies, control and treatment do not necessarily have to segregate on a bi-plot.
Perhaps you should explain more about the laboratory preparatory work that you performed: How was RNA isolation and extraction performed?; which kits were used?; how was sequencing performed and to which depth?
I find that many people worry too much about batch effects. The key to tackling batch is in your experimental design, which you need to describe in detail here.
If the poster didn't transpose his count matrix before doing prcomp, then
$rotation
might be what he wants.Not sure about that. It looks like
prcomp()
was used. The variable names in prcomp are confusing.rotation
is just the component loadings, whereasx
is the original data (possibly centered and scaled) multiplied by the loadings, which is what people usually plot. I have very rarely seen people make bi-plots of component loadings.Let's do an example. Create random data-matrix with genes as rows and samples as columns:
Perform PCA:
Transpose the input data (
t(a)
) to see what happens:So, iraia.munoa, if your input data has genes as rows and samples as columns, you need to run prcomp like this:
Then plot the
x
variable ofpca
The important thing is that you use the x-component and that you use log transformed data - and I would also suggest that you use scale = TRUE in prcomp.
so you are agree with Kevin Blighe in doing the transposition? I would try it right know and tell you all about the result. Thanks everybody!
Hi again! I think I have manage to follow your recomendations. I write them right here to see if I am correct (after TMM normalization, log transformation and removing batch effect):
Defining prcomp with scale=TRUE, and transposing to use the x values:
Plotting the results:
Conclusions:
Is this correct? I mean, I am working with this data for my phd project, would it be correct to use it in this way? or someone recomends me to follow other steps? And something more important, after this steps and with that PCAplots would you follow the proccess and do the DE, believing in the results? I would expect that after removing batch effect, all control samples would be grouped in a corner and all treatment samples in the other corner, I mean with a good separation, and I only see this clasification if I draw a line from the upper left corner to the bottom right corner, and I don't know if it is enough to explain correctly sample distribution. Maybe I have an incorrect understanding of how to understand the PCA plot significance in analyzing RNA-Seq data, so any explanaition would be welcomed!
Thanks in advance
The PCA bi-plot looks as expected now, i.e., using the
x
value. The batch correction seems to have clearly identified a batch effect and adjusted for it, too. You are still not showing the percent explained variation on the axes on the bi-plots, though (?).Oh! sorry, with my coments I did not want you to say if everything I have done is correct or not. I only want recomendation of which steps to follow or detection if something I have shown here is correct or not. I mean, we don't have bioinformaticians in our research group, so I am trying to understand and learn everything by myself (as I think a lot of researchers do with sequencing analysis) and trying to do the job in the better possible way. So I only wanted to know if after the batch effect adjust it is a good idea to follow the DE analysis, I mean if after looking at the second PCA plot, experts in the field would say "Well, now the sample clusters are better, and it is worthy to follow with the DE analysis", or on the other hand "Well, the adjust is correct, but maybe there are another steps worthy to follow before DE analysis".
Even so, I will try your idea of doing the same DE analysis, with data before adjustment and after adjustment and compare both DEG lists (and of course the comparisons of all the plots you mentioned above).
Answering your questions, the variation percent is:
68% and 77% of cumulative variance is a small proportion?
And finally, my goal of using removeBatchEffect() and doing before and after MDS and PCAplots was to see if after adjusting the batch effect, the clusterization of samples was better. But as I have read, I can't use the data after removing the batch effect for DEG analysis right? I need to do the edgeR protocol (or DESeq) but taken into account the "batch" in the matrix design right?
Thanks again
I see what you mean. For your differential expression analysis, just include
batch
in the design formula and use the un-adjusted normalised counts. The batch effect looks consistent in your study, so, the regression model should be easy to fit by EdgeR's internal engine.Please take time to read responses here: https://support.bioconductor.org/p/87718/
Thanks again for all your answers, you have helped me a lot! If someone more wants to add some other points of view they are also welcome to writ in this post.
What about the variance values? Are they enough to focud on PC1 vs PC2 or do I need to change to PC3 or something like that?
Regarding the explained variation, nothing seems out of the ordinary to me.
Just regarding these lines:
This should be:
..or:
Take a look here: https://support.bioconductor.org/p/116614/#116621
I didn't know about "prior.count". I have read what Gordon says in the post you linked, and I have used it in my samples, and the MDSplot looks the same. I don't understand really well why I should use it. It is to avoid that 0 counts become Inf when doing the log transformation? And how do I know which number to chose there? why 2? Thanks,
To follow with RamRS answer, I didn't know the need to transpose the count matrix, and I simply followed a pipeline posted in my first reply. I though that it was correct to use it as a global value for each variables, as they explain the rotations value as coefficients of the linear combinations of the continuous variables.