Hello there, hope all of you are fine. I do hope you are enjoying this weekend.
In my little experience, I always had to deal with samples coming from different batches (i.e. coming from different hospitals or experiments done in different days). One postdoc in my lab showed me how to deal with batch effect by using SVA package. I guess it is a brilliant idea to work with that, but if I don't go wrong this is not the best tool to cope with the batch effect as, in my cases, I may know the source of confounding (in other words, I know that samples are generated in different days/ come from different hospitals).
My first question is: at first glance, by looking at the PCA plot from this experiment, how can you determine (i.e. be absolutely sure) that your samples are biased or not. How can you be absolutely sure that your samples need a correction, if they cluster as expected? and if they don't cluster as you would expect how can you know that this is not due to real biology or not? what if you are over-correcting samples and removing relevant biological data that, in turn, make impossible to determine genes that are truly changing? how can you know that?
My second question is more practical and, basically, linked to my inexperience. Given the fact I have always used this SVA package by slavishly following postdoc recommendation (with her own doubts), I would like to deal with this problem in a definite way; hopefully, by designing a model matrix. I have zero ideas on where to start and if you could help me with some advice, that would be grand! I really need your help guys, cause I am absolutely alone now and don't know who to ask. I have found this link but it seems to be a bit advanced for myself.
Thanks Carlo for your reply and to all of them who will provide further help to this big question mark in my head!
so let's say we really want to know genes found to be differentially expressed in neutrophils_cancer_type1, neutrophils_cancer_type2, neutrophils_healthy.
I have not included 'factor2' as in Carlo's example because I don't think there is another variable here.
So, question number 1: Can you confirm this?
OK, let's go ahead. I am using Kallisto to perform pseudoaligment and then DESeq2 via tximport (i.e. to generate a txi.kallisto.tsv count matrix that will be used as input for DESeq2). So, once you've generated your Sample Table, if your samples come from the same batch you are ready to go with the following:
But, in this case, should I consider something like that?:
Then, create with DESeq2 (via limma package)
So, if I got it right, by doing this you would be able to take batch effect into account when doing differential expression analysis? So, by doing this from now on
you can consider the latter dds free from batch effect, because you have take into account for this in the aforementioned code?
I think I am missing something, here.
The
vsd
stuff is just for the PCA, you need to change your design to~batch+condition
.Thanks Devon, so, after having read here and here, it seems clear that I have to do:
and, by doing this,
one should have taken into account the fact that in your experiment there's a batch effect and thus that genes that are differentially expressed are truly dependent by biological effect?
Correct, this controls for the batch effect. It's not the same as what limma is doing, but you'll have to stick to limma if you really want to use
limma:removeBatchEffect()
for the DE step.Thanks for clarifying that. In terms of biological results(low false positives and likelihood to truly address biological question) what’s the difference between controlling for batch effect and removing it from the analysis? What’s the usefulness of controlling for a bias in your experiment rather than removing the source of its bias? Thanks
I'll quote from limma's manual:
So the limma authors suggest not using it for differential expression. I expect that in practice doing DE on batch-effect corrected data leads to higher false-positive and false-negative results. One would have to search the bioconductor site for discussion of this though.