Hello everyone,
I have a question concerning RNA-seq analysis, particularly about the differentially expressed genes. To provide some background, we used three biological replicates for each group where cell lines exposed to arsenic or not for a specific time period. We dealt with batch effects, as two replicates from each group were sequenced later. To mitigate these batch effects, we employed ComBat_seq, which significantly reduced the batch effect. We also implemented a count threshold to filter out lowly expressed genes across the groups, setting a minimum of 10 counts in each group. However, the resulting volcano plots looked quite unusual, with most of the differentially expressed genes (DEGs) falling within a log2 fold change of 1. I'm unsure how to interpret this. The FASTQC reports were satisfactory, and all samples showed more than 85% of mapped reads. Could this pattern be related to the use of ComBat_seq, where transcript count levels might have been altered by the algorithm to correct for batch effects, or might there be another explanation? Thank you!
PCA Plot after batch correction:
Volcano plot showing DEGs with no lfcshrinkage:
Volcano plot showing DEGs with lfcshrinkage:
There is a good number of DEGs in the top plot beyond abs(log2FC) > 1. The shape of that plot generally looks better save for the compression in the middle because of your X-axis scale (see below). There are savvier RNA-seq users on this forum than yours truly and maybe they will tell you that removing low counts is justified, but I never do that. Never had to deal with the batch effect you described.
I would suggest that you change the X-axis in these plots. By stretching it in the [-7, 7] range, you are unnecessarily making it difficult to see the business part of your plot. The range should be at most [-5, 5] for the top plot if you wanted to keep it symmetric, but [-5, 3] would also work.
Thank you for the suggestion. Limiting the x-axis did improve the visuals. However, there are only a handful of DEGs that lie beyond abs(log2FC) > 1 with most of them nearing 1. I'm uncertain if this reflects the inherent nature of the data.