This is what it says on PANTHER with regards to the test:
The expression data analysis statistics now include a Bonferroni correction for multiple testing. The Bonferroni correction is important because we are performing many statistical tests (one for each pathway, or each ontology term) at the same time. This correction multiplies the single-test P-value by the number of independent tests to obtain an expected error rate.
For pathways, we now correct the reported P-values by multiplying by the number of associated pathways with two or more genes. Some proteins participate in multiple pathways, so the tests are not completely independent of each other and the Bonferroni correction is conservative. For ontology terms, the simple Bonferroni correction becomes extremely conservative because parent (more general) and child (more specific) terms are not independent at all: any gene or protein associated with a child term is also associated with the parent (and grandparent, etc.) terms as well.
To estimate the number of independent tests for an ontology, we count the number of classes with at least two genes in the reference list that are annotated directly to that class (i.e. not indirectly via an annotation to a more specific subclass).
So I have been submitting gene lists with LFCs to PANTHER. When I apply the correction, I get no results. However, when I run panther without it, I get lots of significant results. How important is it to the results? Is it too stringent?
You're right that the corrections assume independence of the tests. However, the Bonferroni correction in the case of non-independence is over-conservative (see for example here) so it's OK to use if you're fine with that. The same goes for the Benjamini-Hochberg FDR correction, see this paper. So using these corrections approaches in the case where the tests are not independent is perfectly justified.
What do you mean by this ? Bonferroni does strictly control the type I error rate.
The problem with the Bonferroni correction is that for large numbers of tests, it becomes way too conservative to the point where one doesn't find anything significant which is why in such cases, people prefer to use the FDR approach.
There's no single correct way of dealing with multiple testing. It really depends on the situation and on how costly are the false positives (type I error) versus false negatives (type II error).
I think we'll just have to agree to disagree here. I understand what you're saying but in what sense and to what degree interpreting results using these methods when test assumptions don't hold is questionable. Whether conservative or not. The a priori probability that two dependent hypotheses are both false is not 25%, it's unknown until you have priors to assess the relationship. So if you can't make an apriori assumption there, why does it make sense to make an assumption about family wise error or false discovery rate?
I agree that in general, one should make sure assumptions hold when applying statistical tests but I think you're missing the point of what conservative means. It means that the p-value you obtain is guaranteed to be greater than the real one. When the assumptions are met, the FDR gives you the proportion of false positives but when the tests are not independent, it gives you an upper bound. I could still be useful to know that you have less than 10% false positives.
Sure, but conservative to what degree? How conservative is too conservative even? Makes something like OPs issue impossible to address using these types of statistical testing. If there is a potentially better statistical method available to interpret the data, then use that one instead right?
You know to what degree it's conservative: you get an upper bound on the p-value. There is no method better than another for multiple testing adjustment. Whichever you choose will give you a trade-off between false positives and false negatives. People are usually concerned with the removal of false positives but the cost of being sure of having none can be that you've also removed many true positives (sometimes all of them). In the case of GO terms enrichment, the question is what is the cost of considering a list of genes to be enriched in a particular term ? If we want to characterize the list or indirectly the process generating the list, we don't want too many mistakes but on the other hand, we don't like to have nothing to report so false negatives are an issue, even though few people acknowledge it. What's the solution then ? In my opinion, if one doesn't like the probabilistic treatment of the data, one should design experiments that directly and unambiguously address the question of interest. This usually means focused experiments. When this is not possible or not the goal, the other option is to try and minimize the number of tests. First, consider that statistical significance doesn't mean biological relevance and in some instances, the null hypothesis of the test is not even biologically credible so reasoning with domain knowledge should be preferable to blind statistical tests. Second, one can be smarter in the choice of tests. For example instead of testing the whole GO like most tools do, why not only test pertinent terms (e.g. like in this paper or using domain knowledge or even common sense, e.g. why test for organism development terms when one is doing experiments using cells in culture) ? Statistical testing in biology is an ill-posed problem (or may be simply misused). The question most people want to address is whether their hypothesis is true. This is not the question that statistical tests answer. The tests measure how compatible your data is with the null hypothesis which usually is some sort of model for random data generation. Therefore rejecting the null hypothesis doesn't mean that the hypothesis the experiment was designed to assess is true. So p-values do not offer support for a particular model or hypothesis, only against the data being observed if the null hypothesis is true, which is usually irrelevant.