Question

Statistical test to compare the number of upregulated genes in a subgroup of genes versus all genes?

0

Entering edit mode

8.2 years ago

biostart ▴ 370

Hello,

I have RNA-seq data for two cell conditions. I want to test a hypothesis that there are more significantly upregulated genes in a given group (~1000 genes) versus all signficantly differentially expressed genes genomewide (~7000 genes).

Say, the subset of interest contains X genes, including X1 upregulated and X2 downregulated. The total genes subset contains Y genes, including Y1 upregulated and Y2 downregulated. How to calculate the P value? I am looking for a simple equation.

Thank you!

P.S. Previous testing of this hypothesis based on quantitative tests (comparing log2 fold changes) was not very successful, resulting in a statistically significant but quite a small difference of about 0.2-0.3 on the log2 scale (see details in the previous thread here). However, there is a very large difference in the numbers of genes which become upregulated in a given subset versus all genes.

rna-seq • 3.9k views

ADD COMMENT • link updated 8.1 years ago by vakul.mohanty ▴ 270 • written 8.2 years ago by biostart ▴ 370

0

Entering edit mode

I'm not sure I follow. Do you really have 7k differentially expressed genes? It seems like an awful lot. Is it possible that you've got some unaccounted for bias in your experiment

ADD REPLY • link 8.1 years ago by russhh 5.7k

0

Entering edit mode

I meant all genes which have a statistically significant change (PPDE>0.95). This does not mean they have to have large log2 fold changes

ADD REPLY • link 8.1 years ago by biostart ▴ 370

1

Entering edit mode

Nonetheless, if half of the genes assayed are 'significant' by some measure, shouldn't you be questioning the validity of that measure

ADD REPLY • link 8.1 years ago by russhh 5.7k

1

Entering edit mode

It really depends on the number of gene tested. I don't know where you got the information that 7000 genes = half the genes tested in this case... It could be much more depending on the organism and whether the OP also includes ncRNA genes in his analysis.

ADD REPLY • link 8.1 years ago by Carlo Yague 8.9k

0

Entering edit mode

I guess, ideally the values for all expressed genes should be significant, which is never reached just because of not enough replicates, etc. This is different from differentially expressed genes, where you set a threshold of log2 fold change

ADD REPLY • link 8.1 years ago by biostart ▴ 370

score 2 · Answer 1 · 2016-10-04

2

Entering edit mode

8.1 years ago

vakul.mohanty ▴ 270

A hypergeometric enrichment test might do the job

ADD COMMENT • link 8.1 years ago by vakul.mohanty ▴ 270

score 1 · Answer 2 · 2016-10-04

1

Entering edit mode

8.1 years ago

Carlo Yague 8.9k

Just use a Fisher's exact test. The equation is quite simple.

You can do it easily with R or Excel or even online.

ADD COMMENT • link 8.1 years ago by Carlo Yague 8.9k

0

Entering edit mode

Thanks, it looks like it is indeed either Fisher's test or chi-square test. Do you think Fisher is better in this case?

ADD REPLY • link 8.1 years ago by biostart ▴ 370

1

Entering edit mode

Yes, in almost all cases Fisher's test is best. It is also used a lot in scientific publications.

ADD REPLY • link 8.1 years ago by Carlo Yague 8.9k

score 0 · Answer 3 · 2016-10-04

Hi,

I don't know how many replicate you have in each condition but it would be great if you have at least 3- 4 replicates per condition, you can restrict your criteria for having less Sig. DE genes, e.g. choose log2(fold-change) >= 1 or >= 2, and try FDR/Q-value <0.05 , <0.01 or even <0.001

Additionally, based on your RNA-SEQ pipeline you can use different tools for finding Sig. DE genes:

Tophat-Cufflinks it has its own Q-Value/FDR for choosing sig. DE genes.
Bowtie/Tophat - HTseq, you could use variety of tools such as DESeq, DESeq2,edgeR or even limma for finding Sig. DE genes.

However you could do it by many other statistical tools or packages in R , for that reasons take a look at this page:

False Discovery Rate Analysis in R

This paper might help you to learn more about RNA-SEQ statistics:

Normalization, testing, and false discovery rate estimation for RNA-sequencing data