Hi all!
Just came across this paper (Francis et al), one from the new "too good to be true" trend, and now I'm totally confused.
The paper provides an estimate of excess success in a set of studies published in Science on a statistical basis, by utilizing reported effect and sample sizes as well as null hypothesis acceptance/rejection ration. Its quite intuitive that when authors support their finding by 20 t-tests with extremely low effect size and P-values very close to 0.05 the finding seems very questionable.
To investigate this problem, the authors have chosen a P-TES (Test for Excess Significance) metric, calculated as the product of statistical test success given its effect size, e.g.
"The estimated probability that five experiments like these would all produce successful outcomes is the product of the five joint probabilities, P-TES = 0.018."
As the probability of success is <= 1, given a paper with a long list of experiments, it is highly likely that we've end up with P-TES < 0.05. In other words, the P-TES score is heavily dependent on the complexity of phenomenon under study.
The authors suggest extending their methodology to check papers in the field of biology. As for bioinformatics, we usually provide lots of complementary analysis for a phenomenon under study. E.g. performing RNA-Seq, Methyl-Seq and ChIP-Seq under multiple conditions for a given transcription factor, checking for its motif over-representation, etc. Would this automatically render a thorough bioinformatics analysis having "excessive probability of success".
Am I missing something critical here??
I think that Francis et al make some good points, he also published a paper - Francis, G. Too much success for recent groundbreaking epigenetic experiments. Genetics 198.2 (2014): 449-451. Any scientific domain that relies heavily on statistical analyses is open to such criticism...
An assumption that experiments are statistically independent is a rather strong one. I don't think a finding with P=0.04 has that much impact, yet considering a scientific paper as a set of unconnected experiments seems rather strange. I believe a more thorough approach should be used instead, e.g. considering a tree-like structure of decisions to perform a certain experiment and using Bayesian framework to calculate the joint probability of success.
Experiments do not have to be unconnected to be statistically independent. Statistical independence is what makes multiplying the probabilities appropriate. The experiments themselves are connected by the authors' proposed relation to their theory.
I agree that statistical independence is quite different term than logical connectivity of a paper. Still the very common thing in research is the following:
One performs a pilot experiment and observe high value of variable X in the group of interest, then given it is known from literature that X and another variable Y have high positive correlation, one would be also interested at checking what happens with Y given it characterizes an important factor. Would it be right to just multiply the probabilities of success for P(x>x0)<0.05 and P(y>y0)<0.05 in this case?
If you are measuring X and Y in a common sample, then you have to take the correlation between them into account, which will always give you a smaller value than just multiplying the individual probabilities of X and Y. You can see examples of this in the PLOS One paper, where sometimes the original paper reported multiple measures from a single sample and reported the correlation between them (or the sample correlation can be computed from other statistics). We then used Monte Carlo simulations to estimate the probability of success for both measures.
Indeed, in this case P(X,Y) != P(X) * P(Y) for observables X and Y, but you are multiplying the probabilities which are more complex ones, i.e. P(p-value < 0.05).
And given that P-value reflects the probability of rejecting the null hypothesis when it is false, the probability in question also depends on a binary random variable, the null hypothesis state in a given experiment. So to multiply those probabilities you must ensure that the null hypotheses are also independent, while those could actually be dependent for linked set experiments. Please correct me if I'm wrong
I'm not sure I follow your comment, but I will try to address it. When a successful outcome is to reject the null, we estimate the probability of success by taking the observed effect size and use it to estimate experimental power. It's not a binary random variable because the effect size estimates the magnitude of the effect. Thus, p=0.03 would give an estimated power of 0.58, while p=0.01 would give an estimated power of 0.73. (The full calculation involves computing an effect size, which requires knowing the sample sizes, and then estimating power from the effect size and sample sizes. To a first approximation, you can go straight from the p-value to power.)
You are correct about the dependence of the hypotheses. For example, in psychology it is common to look for a significant interaction and then look at contrasts to help understand the interaction. Often a successful experiment requires a significant interaction and a particular pattern of significant and non-significant outcomes for the contrasts. We took all that into account with our Monte Carlo simulations.
There were only a few cases in the PLOS One paper where tests were performed between experiments. Sometimes that prevented us from analyzing a paper (because we could not estimate success for four or more experiments).
In short, the TES is kind of like a model checking procedure. We suppose that the theory is correct and that the effects are as identified by the reported experiments. With that as a starting point, we estimate the probability of the reported degree of success, as defined by the hypothesis tests, using the same analysis as was used by the original authors.
I agree that it appears like a vicious circle of p-values. Anyways, what seems strange to me is that even if the same result would be reproduced with P = 0.03 by three independent groups, it would become even more suspicious according to proposed framework. As for me, just showing a boxplot with outliers and reporting the effect size is far more informative for distinguishing important findings from p-hacked ones.
Even for the same effect and the same sample sizes, the p value should vary (a lot) from study to study, just due to random sampling. If four out of four experiments produce a p-value less than the .05 criterion, then that suggests that the experimental design (taking into account the sample sizes and the effect size) should usually produce a p-value much smaller than .05. If experiments often produce p-values around .03, then they should sometimes produce values larger than .05. The absence of the non-significant findings suggests that something is wrong in reporting, sampling, analyzing, or theorizing.
I was simply saying that if one takes all papers for a given phenomena, say 50 papers each containing 10 experiments with probability of success of 0.99, he automatically gets Ptes < 0.01. Is there any way to correct for this? On the other hand, if one measures enrichment for 10000 Chip seq peaks, he could get P=10^-20 simply due to some bias, which won't be reproduced at all in subsequent studies with careful controls.
There is nothing to correct. For 500 experiments that each have a success probability (power) of 0.99, the expected number of successful (significant) outcomes is 500 * 0.99 = 495. Now, in this case we can see that we are only overly successful by 5 experiments, but in any real world situation we do not know that the true power is 0.99. So, when we see 500 successful experiments with an estimated power of 0.99 all we know is that something is odd. We do not know how odd things are, so an experiment set like that should be carefully scrutinized for sources of bias. Of course, the whole analysis is based on the assumption that the 500 studies are related to a single theory. If they are just 50 papers studying different things, then we need not be concerned. That is, we would be introducing the bias ourselves by grouping these studies together and leaving out other (maybe non-significant) studies.