I'm working with RNA-seq data and I 'm trying to figure out if in my data there is a batch effect. I normalized my data in TPM through Salmon and then I run the following code
This plot won't tell you whether you have batch effects or not, you have to look at the biplot for this to see whether samples cluster unexpectedly, or run tools like sva to check for hidden unwanted variation. To do so you probably want to summarize your salmon transcript abundance estimations to the gene level, see tximport package, and then properly normalize the counts, e.g. via DESeq2::vst or the normalization method in edgeR.
To oversimplify it a bit, PCA attempts to explain the variance in data with many dimensions (such as tens of thousands of genes), in much fewer dimensions.
For RNA-seq, each principle component (PC) is a collection of genes that explain some percentage of the total variance in your data. Take a hypothetical ideal situation where you have 3 WT samples and 3 KD samples that you did RNA-seq for. Most of the variance in the data should occur in the collection of genes that are up and downregulated between the two conditions. Therefore, a single principle component may be able to explain most of the variance in the data because these genes explain the separation of the two conditions.
Let's now imagine that an overworked scientist made a slight mistake in sample collection for the last two KD samples by growing the samples at a slightly higher temperature than the others, and this mistake propagates into changes in gene expression for those two samples compared to all others. You now have two overlapping transcription profiles in your data, the first being the separation of condition (WT vs KD), and the other being the separation of batch (normal vs slightly elevated temperature). Your first PC may still separate the WT samples from the KD samples, but you may notice that only 75% of the variance in the data is explained this time, instead of the close to 100% you got from the ideal situation. What you do notice now though is that you have a second PC that explains 25% of variance in your data. When you plot this dimension you can clearly see that the two samples grown in the slightly elevated condition are being separated from the other 4 samples grown at the correct temperature. This means that the second PC is capturing the variance in the data introduced by the temperature difference.
What your scree plot is telling you is that you have roughly 4-7 PCs that may be of interest in your data. Without knowing how many samples and conditions you have it would be difficult to give any specific reason for this. You could make a pairs plot from the excellent library PCAtools to explore how the samples are being separated by the different dimensions. Using the same library, you could also look at the loading plots to see the genes that are contributing to each dimension.
Thank you for your very clear explanation. Actually I have a total of 60 samples: 30 tissues from one in individual and 30 from another one (e.g liver1 and liver2, lung1 and lung2 and so on).
What your scree plot is telling you is that you have roughly 4-7 PCs that may be of interest in your data
It means that most variation is captured by the first few PCs but as said above this is not really relevant in terms of batch effects. Look at a biplot to make statements about relevant batch effects. If you need guidance then please post a plot, e.g. check the plotPCA function from DESeq2 or the PCAtools Bioconductor package.
Hi,
thank you for your answer!
These are 2 plot from plotPCA function from the DESeq2. A batch that I could consider is the different days of library preparation (e.g Date1 and Date2).
This plot won't tell you whether you have batch effects or not, you have to look at the biplot for this to see whether samples cluster unexpectedly, or run tools like
sva
to check for hidden unwanted variation. To do so you probably want to summarize your salmon transcript abundance estimations to the gene level, see tximport package, and then properly normalize the counts, e.g. viaDESeq2::vst
or the normalization method inedgeR
.This might be helpful: https://statquest.org/pca-clearly-explained/
Please don't delete a question after someone spends their time to help you with it.