I have an RNA-Seq counting table containing approximately 400 patients with similar diseases, and these patients were sequenced in four batches. (Among the batches, we have noticed a strong effect in batch 4 and a "moderate" effect in batch 1.) Our primary focus in the working group is on a single disease, and within this disease, I am particularly interested in subgroups that have received different treatments. Let's say that, one of my analysis groups consists of only five individuals at two different time points. (I am well aware that with such a small sample size in this example, the analysis is underpowered). My goal is to identify differentially expressed genes.
The challenge I'm facing is whether to include the batch effect in the design of DSeq2. This approach may not yield the best results since it could lead to poor estimation (e.g., two samples from batch 1 and one from each of the other batches). In such cases, the inter-individual effect might be confounded with the batch effect.
To overcome this challenge, I'm considering using CombatSeq on the entire initial population before conducting DSeq2 analysis on the subgroup of interest. By doing so, I hope to mitigate the impact of batch effects and improve the accuracy of the differential gene expression analysis within the subgroup. What do you think ?
I apologize if my post is not clear enough; this is my first time posting here. Your insights and suggestions would be highly appreciated. Thank you.
I've never used ComBat-Seq, but I think that two approaches with more widely used tools could generally go as follows:
Pass batch into
DeSeq2
design matrix, find logFC (for comparison of interest) and unadjusted p-value for your genes of interest, and simply report those.Use
ComBat
to get a normalized matrix and then simply plot the values for your genes/samples of interest by group. I don't think its good practice to perform any statistics here, but you say this work is exploratory so this could give you an intuitive feel for what is happening.Hi there and thank you for your reply!
In my situation, I'm looking for an alternative approach due to the limited number of samples I have. This number of samples makes it quite impossible to accurately predict the effect of 4 batches in some of my tiny datasets, which can consists for example of comparing five individuals in two different conditions, using DSeq2. This design presents the risk of confounding the batch effect with the between-individual variability (I am fully aware of the statistical power limitations inherent in this virtual analysis).
Nevertheless, there is evidence of batch effects throughout the 400-patient cohort (as observed by UMAP, MDS, SOM and similar methods). Therefore, we would like to address this effect during the pre-treatment phase, if possible, for a prediction with more statistical power. It should be noted that our dataset includes RNASeq data, as opposed to microarray data, which influenced our selection of CombatSeq.
I understand that conventional wisdom urges incorporating any covariate into a single mean comparison test rather than controlling for that effect in a prior model. If no other recourse is available, I would perform a standard DSeq2 analysis without considering the batch in the design matrix. However, given my background as a statistician rather than a bioinformatician, I'm curious if you've encountered a similar case for example where it was possible to perform batch preprocessing on an entire cohort before running DSeq without considering the batch on a small subset.
I hope my problem statement is clear enough and thank you for your time !