Question

Batch effect correction to compare differentially expressed genes across datasets

0

Entering edit mode

3.6 years ago

mmitra ▴ 60

Hi all,

I have three different datasets (datasets 1, 2, and 3, each coming from a different paper) and for each dataset I am comparing the same two conditions A and B. I would like to do differential expression (DE) analysis between A and B for each of the datasets and compare the DE genes across these datasets. I have created a PCA plot for all the samples (representing conditions A and B) from the three datasets by merging the raw counts and then doing the normalization and vst transformation using DESeq2. The plot shows that the samples representing conditions A and B from a dataset are grouping (less separated) together instead of A or B samples from different datasets grouping together.

Shall I do the batch correction of the raw counts matrix (containing all the samples from the three datasets) first, and then using these batch-corrected raw counts to do the DE analysis (between conditions A and B) separately for the datasets 1, 2, and 3?

If I do have to do the batch correction, do I compare different batch effect methods (DESeq2/limma, SVA, Combat-seq)? Do these methods give similar results and any of these can be used?

Thanks in advance for all your help. I apologize if this query has been addressed before on Biostars. In that case, I would appreciate if you can send me the link.

batch batch-effect DESeq2 rna-seq • 6.2k views

ADD COMMENT • link updated 19 months ago by Ram 45k • written 3.6 years ago by mmitra ▴ 60

1

Entering edit mode

Why do you want to combine the studies? Is each study underpowered? Another good option would be to do a meta-analysis, e.g. with RobustRankAggreg. That will tell you which genes do consistently change between these conditions and you do not have to bother with batch correction.

ADD REPLY • link 3.6 years ago by ATpoint 90k

0

Entering edit mode

Thanks for your reply. Each dataset is obtained from a different cell type. I am interested to see if the same genes change in expression between conditions A and B for the three cell types. Or if different genes change depending on the cell type. I will look into the RobustRankAggreg. Thanks for suggesting that.

ADD REPLY • link 3.6 years ago by mmitra ▴ 60

3

Entering edit mode

RNASeq is sensitive to batch effects. If each cell type was prepped by a different lab, you won't be able to distinguish differences due to cell type and differences due to being prepped by a totally different lab. And there is no magic way to remove the batch effect while preserving cell-type differences.

The typical way to deal with batch effects is not to alter the counts, but to include batch as an element of the design. Bu again, you won't be able to include both batch and cell type in your design, because they are the same thing.

ADD REPLY • link 3.6 years ago by swbarnes2 15k

0

Entering edit mode

Thanks so much! I completely agree. I have been struggling with coming up with a good way to do this analysis. Your insights are very helpful.

ADD REPLY • link 3.6 years ago by mmitra ▴ 60

0

Entering edit mode

I would do exactly as swbarnes2 says. Because you're using data from distinct cell types, I would also perform unsupervised analysis on each dataset separately as you have done with PCA. Then, even if in all three cell types your conditions A and B separate, it does not guarentee separation is driven by the same genes in each dataset. Therefore I would also perform post analysis and compare the correlation of the wald statistics for each of the tests individually (D1 x D2, D1 x D3 & D2 x D3).

ADD REPLY • link 3.6 years ago by yhoogstrate ▴ 150

score 2 · Answer 1 · 2022-04-20

2

Entering edit mode

3.6 years ago

andrew.j.skelton73 6.7k

As others have said, batch correction and combination is not the answer here, but to leverage the power in each dataset individually. I'm a big fan of the Mitch framework for these occasions and it fits pretty perfectly to your problem!

ADD COMMENT • link 3.6 years ago by andrew.j.skelton73 6.7k

0

Entering edit mode

Thanks a lot! I will try this framework. It is great to know about these tools.

ADD REPLY • link 3.6 years ago by mmitra ▴ 60

score 0 · Answer 2 · 2022-04-20

0

Entering edit mode

3.6 years ago

Pappu ★ 2.1k

Add ~condition+batch in the formula when you run Deseq2.

ADD COMMENT • link 3.6 years ago by Pappu ★ 2.1k

0

Entering edit mode

Since it behaves like a regression model, I believe you first need to add batch effects and then your condition: ~batch + condition

ADD REPLY • link 3.6 years ago by yhoogstrate ▴ 150

1

Entering edit mode

The results will be the same. Since most functions use the last variable by default, the manual suggests to use ~batch + condition as explained here: https://support.bioconductor.org/p/121408/ Anyway sva package is better suited for batch effect correction.

ADD REPLY • link 3.6 years ago by Pappu ★ 2.1k