I have ~1000 RNAseq samples that come from 100 donors and am using edgeR to analyse it. The tissue from the 100 donors was treated with either 9 different chemicals (A, B, C, ...) or not treated at all (control).
Unfortunatly, due to technical reasons, I had to remove some of the control condition (only 3) due to a low number of reads (<5e6). This means that I have some "unpaired" samples, i.e. they received the treatment but I have no control for them. Is it correct, that I can still formulate a design model like this, if I am just interested in the average difference between for example treatment5 and the control?
#my design (for simplicity all batch effects are ignored)
design = ~ patient.id + treatment
A related question is the following. Because I have so many samples, I am using sva to adjust for batch effects. Is it correct, that after using this approach, I can remove the patient.id
from my design because the identified surrogate variables account for patient-specific effects.
#my design after SVA
design = ~ SV1 + SV2 + SV3 +SV4+ SV5 ... + SV12 + treatment
Any input is very much appreciated!
Cheers
Thank you very much for your answer, that was very helpful!
I did some more investigation of my data and totally agree with you. For me, SVA was unable to "pick up" the donor effect (i.e. still siginificant association between donor and PC1-5). To account for this, my idea was to include this in the model that I give to sva like so:
Do you by chance know if this is the correct approach or should
mod0
look like this:model.matrix(~1, colData(dds))
.Furthermore, I downstream I am using
limma::removeBatchEffect
for visualizations. Is the following usage in this case correct given my SVA code:Any help is much appreciated!