From the power analysis tools I've seen for RNA-seq, I feel like many published RNA-seq studies are wildly underpowered but no one ever seems to bat an eye about it. I very often see RNA-seq studies with 6, 8, or 10 samples per group, and from the power analysis tools I've looked at (https://cqs-vumc.shinyapps.io/rnaseqsamplesizeweb/ ; http://www2.hawaii.edu/~lgarmire/RNASeqPowerCalculator.htm ), it seems like these studies aren't really able to pick up on much of anything. I also see studies reporting only raw p-values and not FDR-corrected ones.
Why do we accept this, and why does nobody talk about it?
I also see studies reporting only raw p-values and not FDR-corrected ones.
I've had the opposite experience. Any RNAseq researcher worth their salt would report padj, not p. They might call it p-value on the plot, but explain it's adjusted for multiple tests elsewhere. If they're not, the statistics aren't sound.
The number of samples is limited by cost and practical, real-world limitations. With the variety in phenotypes and the personalized level that medicine is getting to, large numbers that conform to a uniform phenotype are not easy to come across, or can only be generalized by ignoring known biological differences. IMO we need to change our methods to work with smaller sample sizes, not ignore biology to account for statistical significance.
I just want to add that since statistical power is referring to type II error, and despite statistical power often being low, I routinely see hundreds to thousands of genes with differential expression between conditions because of the large magnitude of the effect size. This is often enough to make an informed judgment for the hypothesis being tested. It's a case of shooting for good enough, simply because the addition of replicates is expensive and subject to increasing diminishing returns.
There is a big difference between 3 replicates of a cell line or in-bred, laboratory conditions raised model organism and 3 replicates of samples from out bred, wild raised organisms (like primary human samples).
Not to sound too cynical but in most RNA-seq studies I see hundreds of genes are differentially expressed. GSEA, DAVID, or Ingenuity (at least one of them) will make up a story about them that will justify a paper. Experiments only require a high number of samples if you actually have a scientific theory a priori.
This is my experience the vast majority of the time. And I think the saddest part of it is that the bioinformaticians who analyze these underpowered studies and sign off on using raw p-values (example I gave in a reply to the top answer: https://www.nature.com/articles/nm.4386 ; "An individual gene was called differentially expressed if the P-value of its t-statistic was at most 0.05.") know for a fact that the data they're analyzing, where adjusted p-values show 0 DEGs, is crap. But when they get the opportunity to the do the analysis on a prominent PI's paper, why in the world would they turn down that opportunity?
We absolutely should not be accepting using raw, unadjusted p-values in RNAseq experiments. Thankfully this is much less common now than it used to be, particularly in the early days of microarrays.
Estimation of power in RNAseq is very difficult. Part of this is that we often have no idea what the effect sizes we expect are, and asking what the disperision is doesn't really even make sense, because we have 20,000 genes, and each will have a different dispersion, finally, we often have no idea how many genes we expect to be DE. Empirical studies of power in RNAseq experiments suggest that you get 80% power to detect 2 fold changes at 5 replicates and recommend doing 6 so that a poor quality replicate can be discarded.
We put up with the current situation for many reasons, but it is worthwhile noting that RNAseq experiments are rarely hypothesis tests of the DE status of particular genes (which is what power calculations address). Power to detect pathways, or correlation structures, genotype-expression relationships etc, is not captured by power calculations on individual genes. Added to this, RNAseq experiments are more often hypothesis generating than hypothesis testing, and studies should contain further tests of the hypotheses generated from the RNAseq experiment.
Part of this is that we often have no idea what the effect sizes we
expect are, and asking what the disperision is doesn't really even
make sense
This is a good point and passes through my mind each time that I am asked to do a Power analysis / sample size calculation for an expression study. With no previous literature, one can only come up with different scenarios where the effect size changes.
If i'm asked to do this, I just point to Schurch et al, and say "they were using a clonal yeast population, and your probably has higher variance". People soon learn not to ask :D.
But I'm serious that having sufficient power to detect if gene A is DE is a very different question than asking for a set of genes tend to behave in a particular way, which is usually what we want are asking when we use RNAseq for hypothesis testing (rather than hypothesis generation).
There are studies in top journals that use human tissue for RNA-seq, but the sample sizes are too small to where the adjusted p-values don't show any DEGs. Here is one example: https://www.nature.com/articles/nm.4386
"An individual gene was called differentially expressed if the P-value of its t-statistic was at most 0.05."
This is human tissue from a clinical population who have major depressive disorder, just for their post-mortem brain tissue to be used in an underpowered study. There's something wrong and perverse to me about this.
This is indeed a paper full of problems. My guess is that it wasn't reviewed by a statistician.
While the authors do do nanostring validation of 20 of their DEGs, and use a validation cohort to measure validation rate. They get an 85% validation rate in males and 75% in females. One way to look at this is that they have a false discovery rate (FDR) of ~25% (this isn't quite true, because their validation set is not a gold standard).
It strikes me that the sex difference was probably a post-hoc hypothesis after their main hypothesis didn' turn up anything interesting. But they have at least done it properly - when a hypothesis is generated on one dataset, it should be tested on a different one.
However they really should be using a differences-in-differences or interaction design to test the difference between sexes. This is one of my pet peeves
You'll notice that none of their findings are based on significant differences in particular genes, as I was saying. Everything is about enrichment over different sets - to a certain extent the significance of individual genes is immaterial. However, the mistake they make again and again is assuming that a p value > 0.05 means an effect definitely isn't present - if there is no significant overlap between sets, then they are different. If there are no significantly similar modules, then that means the modules are different.
Finally, they do do an experimental test of their RNAseq derived hypothesis by doing RNAi of a couple of sex specific genes and measuring behavioral traits. I'm not an expert in behavioral tests, so I can't comment on them. Other than to say they make the same mistake again - they assume that because the behavioral change is significant in one sex, but not the other then the two sexes are different, rather than directly testing the difference between the sexes.
All-in-all, this paper has quite a lot of statistical problems, but actaully, I think using p values rather than FDRs is probably the least of them.
I appreciate your points. Are there established methods for doing power analyses to detect pathways, correlation structures, genotype-expression relationships etc.?
The issues I see with what I like to call "behavioral neurogenetics" are part of the general workflow I see in a lot of papers:
Run a behavioral experiment with mice/rats (e.g., food/drug self-administration, forced swim test, sucrose preference test)
Sequence a brain region of choice, run DE analysis
Do some bioinformatics (pathway enrichment, co-expression networks), pick a gene you want to do RNAi/siRNA/CRISPR on
Do RNAi/siRNA/CRISPR on the gene, run the same behavior as (1) and a bunch of others
Publish whatever is significant, and I almost never see any multiple testing correction done on behavior
A couple of other questions:
If the purpose of RNA-seq is not to test DE of individual genes, but to look at networks, pathways, etc., how much does that actually increase power? If the enrichment of 1000 different pathways is checked rather than the DE of 10-14k genes, does 4-8 samples per group even get you there?
Have there been bootstrapping studies like Schurch et al. but with mice/rats instead of yeast mutants, which as you said in the other reply have less variance than other experimental conditions?
Are there established methods for doing power analyses to detect pathways, correlation structures, genotype-expression relationships etc.
In some specific cases, perhaps. I think there is some work on power estimatin for eQTL anlaysis (genotype-expression relationships).
Not that I'm aware of for the others. I think there would be far too many unknown/approximated/guessed parameters to make such an exercise useful. Although there are some rules of thumb (I probably wouldn't do co-expression analysis on less than, say, 80 samples, although I think the WGCNA authors say 30).
If the enrichment of 1000 different pathways is checked rather than the DE of 10-14k genes, does 4-8 samples per group even get you there?
Absolutely depends on the system and the size of the effects. I'd guess for studies in model organism or cell lines, probably yes. Even in humans it can do - cancer studies using only a small number of samples regularly turn up the same pathways. For other things (like measurements of "normal" physiology in outbred populations, like humans), probably not, but who knows? Might be an interesting master's project.
Have there been bootstrapping studies like Schurch et al. but with mice/rats instead of yeast mutants, which as you said in the other reply have less variance than other experimental conditions
Not that I'm aware of. Such an experiment would be incredibly expensive, difficult, and probably only tell you about the particular intervension you were making in that case, not any other.
the general workflow I see in a lot of papers
2-3 sounds like classic hypothesis generation, 4 like hypothesis testing. I'd say that generally its okay as long as 4 is done properly (which in the case above it doesn't sound like it was): you are mining/dredging large datasets for hypothesis, but thats okay, as long as test in an indepndent experiment.
Of course, its a problem if you test many "hypotheses", but only report the ones that work/don't apply multiple testing correction.
Absolutely depends on the system and the size of the effects. I'd guess for studies in model organism or cell lines, probably yes.
Follow-up: If 4-8 samples is probably sufficient for cell lines and model organisms, but most of those tools take as input DE genes which aren't sufficiently powered to be detected, doesn't this reduce the confidence in that pathway enrichment?
Of course, its a problem if you test many "hypotheses", but only report the ones that work/don't apply multiple testing correction.
This is a major issue. I feel like behavioral neurogenetics is a combinatorial wellspring of papers that'll never run out because of these corners that are cut. Run an underpowered RNA-seq study with 4-6 samples/group, run DE on what is to a large degree noise, pick a gene to knock out/down/overexpress, run a bunch of behavior (at least one result will be significant), publish.
Your set of points about low-sample RNA-seq studies regularly turning up the same pathways gives me hope though and I appreciate them very much. Do you possibly have any references about this, the fact that even low-sample studies in the same domain show consistent pathway enrichment? I'd be curious to read about this alongside Purvesh Khatri's work about how gene ontology results change over time because the annotation databases change over time. https://www.nature.com/articles/s41598-018-23395-2
I'm afraid I don't have any particular literature off the top of my head, but consider:
You have a 5 vs 5 DE experiment, and find 500 genes to have foldchange > 2 at an FDR of 5%. That's approximately 2.5% of all genes.
Now assume that you have a pathway you are interested in with 100 genes in it. Lets say that the "true" mean of the fold-change distribution is >2 for 50% of them, so 50 genes. Now lets assume your experiment has only a 50% power, so you only find half of them - 25 genes or 25% of the gene set.
Thus there is a 10x enrichment (25/2.5) with a fishers p-value < 2.2e-16.
Now lets do this with nominal p-values, rather than FDRs. Lets assume that using a nominal p-value, rather than a q value, gives you an FDR of 75% - that is 75% of your "DE genes" are actaully false positives.
In the same experiment, you'd have 2000 genes that were "DE" or 10% of genes. In your category you'd have another 7.5 false positives as well, so 35 genes (with the 25% true positives) out of 100 or 35%. Thats still an enrichment of 3.5x and a p-value of 1e-11.
So a high FDR doesn't reduce the power to detect strong enrichment of a big pathway by that much. Obviously this is not true for smaller pathways and weaker enrichment.
Sí, existe disparidad entre diferentes regiones del mundo, en relación con dinero disponible para la investigación. En los países latinoamericanos, dentro del dinero que sí es disponible para la investigación, existe aún menos para la biología computacional / bioinformática, con una excepción, siendo São Paulo. No obstante, este no refleja el talento y la habilidad de los investigadores hispánicos/as.
Sí señor.
'Yes, there exists disparity between different regions of the World, in relation to money available for research. In Latin American countries, within the money that's available for research, even less is available for comp. biology / bioinformatics, with one exception being São Paulo (Saint Paul). Nevertheless, this doesn't reflect the talent and ability of hispanic researchers.'
Agree, on some manuscripts we had, few reviewers sometimes ignore/forget that and ask, why do you have 3 replicates, why not 10? We sadly answer that is what budget allows us.
For research in bioinformatics/compbio, I almost never see grants for this, you need to bind the software development to some experimental project.
As someone who works with RNA-seq quite frequently both on cell line and human samples I feel compelled to comment here. No serious researcher at the outset would want to do a RNA-seq (or any NGS study for that matter) without the proper number of replicates.
In my experience, we have considered atleast 3 replicates for all our studies. This is manageable for most part using cell lines, where samples is easily available even if something goes wrong in the sample processing steps. In most cases, there's also sufficient amount of samples leftover after sequencing that they can be used easily in case something goes wrong. However, if you are working with patient samples a lot of issues can arise in the sample processing pipeline and some samples/replicates have to be discarded. Additionally, the qualilty of samples themselves can be quite poor (degraded tissue samples (e.g. from paraffin embedded samples). Obtaining human samples itself is not an easy and quick process as you have to go through a lot of review regarding ethical concerns. In some cases, the samples themselves might be quite rare (e.g. for a rare disease) and it is not possible to get enough samples for your study that fulfills the needed statistical rigor. In those cases, the researcher has to made a judgement about not publishing the result vs. publishing the result (despite low statistical power), to share the potentially useful information that can be helpful for the larger research community. The reviewer examining such papers could also overlook the sub par statistics in favor for biologically important message that the paper might be conveying. In a common scenario, there might also be a lot of pressure for the researcher to publish from funding sources and collaborators. Even if it may be possible to get additional samples, the timeline might not be conducive to everyone involved.
So yes in ideal world, it would really be best if studies were sufficiently powered...however, in real world it's not always possible to do so due to multiple factors. The onus then lies on the reader to make an informed judgement about the publication they are reading.
I've had the opposite experience. Any RNAseq researcher worth their salt would report padj, not p. They might call it p-value on the plot, but explain it's adjusted for multiple tests elsewhere. If they're not, the statistics aren't sound.
The number of samples is limited by cost and practical, real-world limitations. With the variety in phenotypes and the personalized level that medicine is getting to, large numbers that conform to a uniform phenotype are not easy to come across, or can only be generalized by ignoring known biological differences. IMO we need to change our methods to work with smaller sample sizes, not ignore biology to account for statistical significance.
I just want to add that since statistical power is referring to type II error, and despite statistical power often being low, I routinely see hundreds to thousands of genes with differential expression between conditions because of the large magnitude of the effect size. This is often enough to make an informed judgment for the hypothesis being tested. It's a case of shooting for good enough, simply because the addition of replicates is expensive and subject to increasing diminishing returns.
That seems great and you should consider yourself fortunate. I am extremely happy if I can get 3.
There is a big difference between 3 replicates of a cell line or in-bred, laboratory conditions raised model organism and 3 replicates of samples from out bred, wild raised organisms (like primary human samples).
I don't disagree, but 3 of either is better than 1.
I think this topic is better suited as a Forum discussion than a Question, and I'm making appropriate changes.
Not to sound too cynical but in most RNA-seq studies I see hundreds of genes are differentially expressed. GSEA, DAVID, or Ingenuity (at least one of them) will make up a story about them that will justify a paper. Experiments only require a high number of samples if you actually have a scientific theory a priori.
This is my experience the vast majority of the time. And I think the saddest part of it is that the bioinformaticians who analyze these underpowered studies and sign off on using raw p-values (example I gave in a reply to the top answer: https://www.nature.com/articles/nm.4386 ; "An individual gene was called differentially expressed if the P-value of its t-statistic was at most 0.05.") know for a fact that the data they're analyzing, where adjusted p-values show 0 DEGs, is crap. But when they get the opportunity to the do the analysis on a prominent PI's paper, why in the world would they turn down that opportunity?
It's late. But, the authors mentioned FDR corrected p-value in the Method section.