Question

Hypergeometric test for overlapped genes

3

Entering edit mode

11.3 years ago

Nitin ▴ 170

Hello,

I have compared 1370 genes which are obtained from chip seq analysis and 652 number of genes which are differentially regulated genes obtained by analyzing affymetrix 430.2.0 mouse array. When i intersect list both these lists of genes i got 37 genes common.

Now i would like to calculate what is the significance of this over lap. I was thinking to use phyper in R but this requires total number of genes here i am confused which number to give . Should i give total number of probes from affymetrix chip or should i give whole mouse genome number from chip seq data..

One more question can anybody suggest some ways how exactly to perfom this significance test in R.

Thanks,
Sai

gene-expression ChIP-Seq • 7.4k views

ADD COMMENT • link updated 4.0 years ago by Ram 45k • written 11.3 years ago by Nitin ▴ 170

0

Entering edit mode

Sorry for the big letters... I don't know why they look like that...

ADD REPLY • link 8.4 years ago by Gema Sanz ▴ 90

0

Entering edit mode

I think that whatever the results of a hypergeometric test might be in this case, a lot of caution should be advised in interpreting that result.

it is well-known that Chip-Seq, RNA-seq and other NGS based assays of the epigenome and transcriptome measure a lot of correlated outcomes (e.g. shared epigenetic programs affecting many genes).

It is very conceivable that in an example like the one you describe, it could be a much smaller number of factors that lead you to detect 37 genes in the overlap bin. If this is the case, using a test like the hypergeometric would be anti-conservative, and a lot of readers and reviewers might distrust the result for that reason.

ADD REPLY • link 8.4 years ago by LauferVA 4.8k

Ram · Answer 1 · 2014-05-21

I recommend using the number of genes on Affy array (those would be the smallest one). But before you should filter those 1370 genes from Chip Seq so they only contain the genes also present on Affy array. And you should use not probes, but genes, as there are several probes per gene. Then compute 1 - F(37, 652, x, X) + 0.5 * P(37, 652, x X), where x is the number of Chip Genes, X total number of genes, F(.) is cumulative distribution function and P(.) is probability for Hypergeometric distribution.

PS Anyways the P-value looks to be insignificant. Have you tried checking up-/down-regulated genes separately or, better, using GSEA as suggested by Devon Ryan?

Ram · Answer 2 · 2014-05-21

1

Entering edit mode

11.3 years ago

Devon Ryan 105k

Take as the total number of genes (N), the intersect of the genes you looked at via ChIP-seq (all of them in the annotation you used, though this will be a bit of an overestimate, realistically) and those probed on the array (N.B., genes, not probes, as pointed out by mikhail.shugay ). In R, that could be done with:

m <- matrix(c(37,652-37,1370-37, N-652-1370+37), ncol=2,
    dimnames=list(c("ChIP.Sig", "ChIP.NoSig"), c("RNA.Sig","RNA.NoSig")))
fisher.test(m)

You'll find that if N is ~18000 or bigger then this isn't significant.

BTW, you might consider something like GSEA, where you use those genes showing peaks in their promoter (or where ever else you're looking) as the gene set. This would have the advantage in that it looks at the genes as a group, rather than relying on individual significant adjusted p-values. One could easily consider even more complicated tests (e.g., what if many of the DE genes have promoters with low mappability?), but I'd probably avoid going really whole-hog without reason.

ADD COMMENT • link updated 5.7 years ago by Ram 45k • written 11.3 years ago by Devon Ryan 105k

0

Entering edit mode

Hi,

I followed your suggestion to calculate the p-value of my microarray and ChIP-seq data:

# bound chip-seq = 5673
# down microarray = 3975
# overlap = 1156
# N = 17970 

  m <- matrix(c(1156,3975-1156,5673-1156,17970-3975-5673+1156), ncol=2,
              dimnames=list(c("ChIP.Sig", "ChIP.NoSig"), c("RNA.Sig","RNA.NoSig")))
fisher.test(m)

# Fisher's Exact Test for Count Data

data:  m
p-value = 0.0001287
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.7958616 0.9299321
sample estimates:
odds ratio 
0.8604837

Then I wanted to check if a random subset of the same size could be significant too, the overlap of random set with binding was 683. But when I do the calculation, the p-value is much more significant (p-value < 2.2e-16, the lowest result by Fisher test in R). I don't understand why a lower overlap can be more significant....

Am I missing something? Maybe I don't really need to check with a random gene set? This is a reply for reviewers because they suggest that the binding of my TF won't be significant over a random subset, is it enough for reply the p-value I got from Fisher test?

Thanks in advance Gema

ADD REPLY • link updated 8.4 years ago by GenoMax 153k • written 8.4 years ago by Gema Sanz ▴ 90

0

Entering edit mode

I have formatted your code correctly. In future use the icon shown below (after highlighting the text you want to format as code) when editing.

ScreenCap

How To Ask Good Questions On Technical And Scientific Forums

ADD REPLY • link 8.4 years ago by GenoMax 153k