Hello,
I have RNA-seq data for two cell conditions. I want to test a hypothesis that there are more significantly upregulated genes in a given group (~1000 genes) versus all signficantly differentially expressed genes genomewide (~7000 genes).
Say, the subset of interest contains X genes, including X1 upregulated and X2 downregulated. The total genes subset contains Y genes, including Y1 upregulated and Y2 downregulated. How to calculate the P value? I am looking for a simple equation.
Thank you!
P.S. Previous testing of this hypothesis based on quantitative tests (comparing log2 fold changes) was not very successful, resulting in a statistically significant but quite a small difference of about 0.2-0.3 on the log2 scale (see details in the previous thread here). However, there is a very large difference in the numbers of genes which become upregulated in a given subset versus all genes.
I'm not sure I follow. Do you really have 7k differentially expressed genes? It seems like an awful lot. Is it possible that you've got some unaccounted for bias in your experiment
I meant all genes which have a statistically significant change (PPDE>0.95). This does not mean they have to have large log2 fold changes
Nonetheless, if half of the genes assayed are 'significant' by some measure, shouldn't you be questioning the validity of that measure
It really depends on the number of gene tested. I don't know where you got the information that 7000 genes = half the genes tested in this case... It could be much more depending on the organism and whether the OP also includes ncRNA genes in his analysis.
I guess, ideally the values for all expressed genes should be significant, which is never reached just because of not enough replicates, etc. This is different from differentially expressed genes, where you set a threshold of log2 fold change