Question

Which statistical test should be used to test for enrichment of gene lists?

1

Entering edit mode

10.5 years ago

Ashutosh ▴ 10

In a typical ChIP-Seq experiment, We found a transcription factor "ABC" peaks on 590 genes. 242 genes out of the 590 genes are classified as "ORFs". If the number of "ORFs" contained in the genome is 5420 and the total number of genes in the genome is 6226. are the total "ABC" bound genes enriched in ORFs? Whether "ABC" association at ORFs is by chance? Which statistical test should be used? Can it be done in R and how? is there any other way to do this?

I got to know from one forum that test for enrichment of gene lists is to do a hypergeometric test or, equivalently, a one-sided Fisher's exact test. Though I am not very familiar with R, based on other examples I tried to use R for Fisher's Exact Test for count data and my output is like this.

> fisher.test(matrix(c(242,5178,348,458),nrow=2,ncol=2),alternative="greater")

 Fisher's Exact Test for Count Data
data:  matrix(c(242, 5178, 348, 458), nrow = 2, ncol = 2)
p-value = 1
alternative hypothesis: true odds ratio is greater than 1
95 percent confidence interval:
 0.05224986        Inf
sample estimates:
odds ratio
0.06156519

My analysis, If correct suggests that factor "ABC" is present on ORFs only by chance. If the analysis is right what should be the conclusion? Is the above conclusion right? Please help.

Significance-test ChIP-Seq R • 12k views

ADD COMMENT • link updated 3.1 years ago by Ram 44k • written 10.5 years ago by Ashutosh ▴ 10

Ram · Answer 1 · 2014-06-09

Yes, you can use Fisher's Exact, hypergeometric, or a variety of other methods to test for enrichment. But for the moment, forget statistics, just look at the data and re-examine your conclusion. If 5420 of 6226 genes are classified as ORFs, and you have 590 binding events, how many of those would you expect to be ORFs? 590 * 5420/6226 = 513. How many do you actually observe? Answer: 242. Thus you see a depletion of ORFs in your data set. You might turn the question around and ask: are your peaks "enriched" for non-ORFs? The number you would expect by chance is 77, yet you observe 348. If you ask your question this way, you change your matrix as such: matrix(c(348, 458, 242, 5178), 2, 2), and the p-value drops, because the chance of seeing a number "greater" than 348 by chance is very low. The way you asked the question before, was: what's the likelihood of seeing more than 242 ORFs bound by chance? Which gave you a p-value of 1, because the number expected by chance is: 513.

Play with the numbers, and phrase your hypothesis, to get a sense of how it works.

score 1 · Answer 2 · 2014-06-09

1

Entering edit mode

10.5 years ago

Devon Ryan 104k

Your analysis simply shows that there's no enrichment for ORFs in the binding site of the transcription factor (in fact, there's a significant depletion). Not knowing anything about your model organism or the transcription factor in question, it'd be impossible to say anymore than that.

ADD COMMENT • link 10.5 years ago by Devon Ryan 104k

0

Entering edit mode

thanks Devon. my model organism is yeast and I am looking for a histone chaperon occupancy on ORFs.

ADD REPLY • link 10.5 years ago by Ashutosh ▴ 10