Entering edit mode
7.0 years ago
Sharon
▴
610
I noticed some pvalues/FDR like the following:
gene logFC logCPM Pvalue FDR
gene1 8.78610478309506 5.55934275062716 6.98629850992379e-110 1.33333507061895e-105
gene2 8.34642639375796 7.09746407505299 4.20221256332387e-89 4.0099613385518e-85
Are these normal? Those genes counts seem right. But the pvalues?
salmon does not give such values, as far as I know, you must be plugging in the output of salmon into some DE tool and performing differential expression analysis which gives you logFC for the condition you are testing, and the rest metrics.
Sorry, edited. I mean edgeR after Salmon
I'm going to take a guess that your sample numbers are low, or the groups that you're comparing are unbalanced, e.g., comparing 50 samples versus 3. You will obtain unreliable P values in both of these situations.
Unbalanced is still ok but 50 vs 3 totally makes me sad. Mean-variancre fit doesn't really work with such unbalanced design. Tbh biologists need to understand this as well and such designs are only good for exploratory analysis rather than confirmatory ones. However when you say p-value are they FDR corrected or inital p-values?
I am comparing 10 samples control vs 10 samples tumors? what do you mean by unbalanced? For example gene1 above with -105 pvalue has counts less than 20 in each control sample and counts >2000 in each tumor control. How do you think?
A study size of 20 is very low, and, from my perspective, helps to explain the very low P values (but does'nt confirm that it's the sole issue). To give you an idea of why this happens:
Having just 20 samples will not give a global / 'holistic' representation of the disease/condition that you are studying. With such low numbers, there exists high probability that you will observe many transcripts that are entirely lowly expressed in one group and highly expressed in the other. These will be assigned very low P values, and rightly so. However, if you had 20,000 samples, then you would have a much greater representation of expression profiles and your P values would be more 'normal', in both the human interpretable sense of normality and also the statistical sense of normality, i.e., in a well-powered study, all P values from differential expression would line up nicely on a Quantile-Quantile plot.
Ideally there should have been some power analysis done prior to your study in order to determine ideal sample numbers (vchris alludes to study design in his/her comment above).
The only situation in where I would expect extremely low P values like these in a well-powered study would be in a gene knockout situation. However, even then, due to the way that expression data is normalised, even in those situations a gene knockout's statistical significance may not be what was expected.
Just to be sure, could you also plot a histogram of your normalised and then logged counts?
Ok, I will double check this and get back, thanks Kevin.
initial pvalues, but FDR is also very low.
Can you post a histogram of your P-values?
Not sure if this looks okay
https://ibb.co/npMqL6
It looks unusual - not normal. I have responded further above.