Suppose I perform an RNA-SEQ experiment, and I perform three different comparisons in it using, say, DESeq2. (i.e. WT timepoint1 vs MUT timepoint1; WT timepoint2 vs WT timepoint1; MUT timepoint2 vs MUT timepoint1).
In each comparison, there are many genes on which a statistical test is being performed; so within each comparison, a correction for multiple hypotheses has to be applied.
But I perform three comparisons, not one. The chance of falsely rejecting H0 for genes increases with each comparison that is being performed. So should I correct for multiple hypothesis due to the number of comparisons, and why not, if not?
To further play the devil's advocate : suppose that not only I perform this experiment, but another laboratory tries to perform exactly the same experiment. That laboratory also performs a statistical test, attempting to replicate my experiment. Every time we perform a statistical test, we increase the chance of a false positive. So why would the replication of the experiment by another lab not count as a reason to correct for multiple hypotheses?
In other words, what are the necessary and sufficient conditions for applying a multiple hypothesis correction?
Related to the second question (should I correct for other laboratories also testing the data ?), no you should not correct for that – it would be impossible anyway. But you are right that it increases the chance of a false positive. This matter was explored in Ioannidis' famous paper Why most published research findings are false. Excerpt:
Regarding the first issue - is there a standard way to adjust for multiple comparisons being made? Could you refer me to any resource? I have not seen this issue being addressed at introductory-level DESeq2 tutorials, and I don't think it appears in the DESeq2 vignette as well.
i would take the universe of unadjusted p-values from your runs, put them in p, and use
p.adjust(p, method = 'BH', n = length(p))
There is an interesting post by Gordon Smyth (author of limma package). I admit I don't quiet follow his logic, but he states that where he says
Regarding your suggestion, he specifically states that
The logic of this is based on how BH pvalue correction depends on the distribution of the raw pvalues. If pvalues are uniformly distributed (not many DE genes, or at least not much more DE genes than expected if there is no factorial effect), then BH correction will be very stringent, and almost no adjusted pvalue will be "significant". On the other hand, if the pvalue distribution is skewed towards low values (many DE genes), then the pvalue correction is less stringent and most DE genes will remain significant after correction.
In simple terms, Gordon's post explains that if you mix pvalue from different contrasts before applying BH correction, then you "blur" the [pvalue distribution] -> [correction stringency] relationship and might introduce bias.
Gordon Smith seems to say that if one controls the FDR within each comparison, then one doesn't need to worry about multiplicity with many comparisons (contrasts), when looking at the overall FDR (since FDR is a scalable quality). (Though I am not 100% sure it applies to Benjamani-Hochberg from his answer)
That makes no sense to me. If I do 100 experiments and look at one gene in each, then I have to correct for multiple comparisons post-hoc but not a priori.