Statistical test to compare the number of upregulated genes in a subgroup of genes versus all genes?
3
0
Entering edit mode
8.1 years ago
biostart ▴ 370

Hello,

I have RNA-seq data for two cell conditions. I want to test a hypothesis that there are more significantly upregulated genes in a given group (~1000 genes) versus all signficantly differentially expressed genes genomewide (~7000 genes).

Say, the subset of interest contains X genes, including X1 upregulated and X2 downregulated. The total genes subset contains Y genes, including Y1 upregulated and Y2 downregulated. How to calculate the P value? I am looking for a simple equation.

Thank you!

P.S. Previous testing of this hypothesis based on quantitative tests (comparing log2 fold changes) was not very successful, resulting in a statistically significant but quite a small difference of about 0.2-0.3 on the log2 scale (see details in the previous thread here). However, there is a very large difference in the numbers of genes which become upregulated in a given subset versus all genes.

rna-seq • 3.9k views
ADD COMMENT
0
Entering edit mode

I'm not sure I follow. Do you really have 7k differentially expressed genes? It seems like an awful lot. Is it possible that you've got some unaccounted for bias in your experiment

ADD REPLY
0
Entering edit mode

I meant all genes which have a statistically significant change (PPDE>0.95). This does not mean they have to have large log2 fold changes

ADD REPLY
1
Entering edit mode

Nonetheless, if half of the genes assayed are 'significant' by some measure, shouldn't you be questioning the validity of that measure

ADD REPLY
1
Entering edit mode

It really depends on the number of gene tested. I don't know where you got the information that 7000 genes = half the genes tested in this case... It could be much more depending on the organism and whether the OP also includes ncRNA genes in his analysis.

ADD REPLY
0
Entering edit mode

I guess, ideally the values for all expressed genes should be significant, which is never reached just because of not enough replicates, etc. This is different from differentially expressed genes, where you set a threshold of log2 fold change

ADD REPLY
2
Entering edit mode
8.1 years ago
vakul.mohanty ▴ 270

A hypergeometric enrichment test might do the job

ADD COMMENT
1
Entering edit mode
8.1 years ago

Just use a Fisher's exact test. The equation is quite simple.

You can do it easily with R or Excel or even online.

ADD COMMENT
0
Entering edit mode

Thanks, it looks like it is indeed either Fisher's test or chi-square test. Do you think Fisher is better in this case?

ADD REPLY
1
Entering edit mode

Yes, in almost all cases Fisher's test is best. It is also used a lot in scientific publications.

ADD REPLY
0
Entering edit mode
8.1 years ago
Whoknows ▴ 960

Hi,

I don't know how many replicate you have in each condition but it would be great if you have at least 3- 4 replicates per condition, you can restrict your criteria for having less Sig. DE genes, e.g. choose log2(fold-change) >= 1 or >= 2, and try FDR/Q-value <0.05 , <0.01 or even <0.001

Additionally, based on your RNA-SEQ pipeline you can use different tools for finding Sig. DE genes:

  1. Tophat-Cufflinks it has its own Q-Value/FDR for choosing sig. DE genes.
  2. Bowtie/Tophat - HTseq, you could use variety of tools such as DESeq, DESeq2,edgeR or even limma for finding Sig. DE genes.

However you could do it by many other statistical tools or packages in R , for that reasons take a look at this page:

False Discovery Rate Analysis in R

This paper might help you to learn more about RNA-SEQ statistics:

Normalization, testing, and false discovery rate estimation for RNA-sequencing data

ADD COMMENT

Login before adding your answer.

Traffic: 1840 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6