Hi all,
I have three different datasets (datasets 1, 2, and 3, each coming from a different paper) and for each dataset I am comparing the same two conditions A and B. I would like to do differential expression (DE) analysis between A and B for each of the datasets and compare the DE genes across these datasets. I have created a PCA plot for all the samples (representing conditions A and B) from the three datasets by merging the raw counts and then doing the normalization and vst transformation using DESeq2. The plot shows that the samples representing conditions A and B from a dataset are grouping (less separated) together instead of A or B samples from different datasets grouping together.
Shall I do the batch correction of the raw counts matrix (containing all the samples from the three datasets) first, and then using these batch-corrected raw counts to do the DE analysis (between conditions A and B) separately for the datasets 1, 2, and 3?
If I do have to do the batch correction, do I compare different batch effect methods (DESeq2/limma, SVA, Combat-seq)? Do these methods give similar results and any of these can be used?
Thanks in advance for all your help. I apologize if this query has been addressed before on Biostars. In that case, I would appreciate if you can send me the link.
Why do you want to combine the studies? Is each study underpowered? Another good option would be to do a meta-analysis, e.g. with
RobustRankAggreg
. That will tell you which genes do consistently change between these conditions and you do not have to bother with batch correction.Thanks for your reply. Each dataset is obtained from a different cell type. I am interested to see if the same genes change in expression between conditions A and B for the three cell types. Or if different genes change depending on the cell type. I will look into the RobustRankAggreg. Thanks for suggesting that.
RNASeq is sensitive to batch effects. If each cell type was prepped by a different lab, you won't be able to distinguish differences due to cell type and differences due to being prepped by a totally different lab. And there is no magic way to remove the batch effect while preserving cell-type differences.
The typical way to deal with batch effects is not to alter the counts, but to include batch as an element of the design. Bu again, you won't be able to include both batch and cell type in your design, because they are the same thing.
Thanks so much! I completely agree. I have been struggling with coming up with a good way to do this analysis. Your insights are very helpful.
I would do exactly as swbarnes2 says. Because you're using data from distinct cell types, I would also perform unsupervised analysis on each dataset separately as you have done with PCA. Then, even if in all three cell types your conditions A and B separate, it does not guarentee separation is driven by the same genes in each dataset. Therefore I would also perform post analysis and compare the correlation of the wald statistics for each of the tests individually (D1 x D2, D1 x D3 & D2 x D3).