Question

Statistic analysis on overlap between RNA-seq and ChIP-seq dataset

0

Entering edit mode

2.6 years ago

Marco Pannone ▴ 810

Hey everybody

I am working on correlating results from RNA-seq and ChIP-seq together.

To simplify it, let's say I have 2 datasets x and y which are represented by a list of DEGs each, identified by RNA-seq analysis, and another dataset z which is represented by a list of genes that shows enrichment of a particular histone mark from ChIP-seq analysis.

My aim is to overlap these 3 datasets (x,y and z) to evaluate how many genes overlap among all 3 datasets. I have done so by simply generating a Venn diagram.

However, I think I am missing statistical evidence. I am looking for a way to determine if the overlaps I got are statistically significant or not. Is there any statistical test which is suitable for this case? Or any other suggestion is highly appreciated.

Thanks in advance!

ChIP-seq RNA-seq • 1.1k views

ADD COMMENT • link updated 2.6 years ago by seidel 11k • written 2.6 years ago by Marco Pannone ▴ 810

1

Entering edit mode

This is a fairly simple problem if you simulate it.

The null hypothesis is that your overlap is indistinguishable from if these gene sets were just randomly sampled. So to generate your null all you need to do is, for 10k+ iterations, randomly sample n genes for your x, y, and z datasets where n is the original dataset size. Your observed overlap should be equal to or greater than 95% of the simulated overlaps to reject the null at an alpha of 0.05.

For the universe of genes you're sampling from you should restrict that gene set in the same way that you did in your analysis. For example, if you only considered protein coding genes in your RNA-seq analysis exclude everything but those from your gene universe.

ADD REPLY • link 2.6 years ago by rpolicastro 13k

score 1 · Answer 1 · 2022-12-07

First off, I'm not a statistician, and I'm not aware of a specific test for a 3-way overlap. However, for the purpose of doing experiments, I find it useful to break this situation down into a series of questions, that when combined give an informative answer. To get dataset x, you study a process which reaches into the genes measured and hands you a set. If the process is random the gene set will be random. Then you do a second experiment for data set y, and you get a second gene set. If the process you are studying is the same, or related, for experiment 1 and 2 then the gene set that comes back from experiment 2 should resemble that of experiment 1. You can use a hypergeometric test for this (see the Urn problem. Experiment 1 is telling you how to construct an urn of black and white balls for experiment 2. i.e. for example, of 15000 genes measured, experiment 1 hands you 500 as DE. So you set up an urn with 14500 white balls, and 500 black balls, and when experiment 2 gives you 600 balls as a result, you can determine if the 600 balls drawn from the urn are the same as the ones put there by experiment 1, that is, if your sample of 600 is enriched for black). You can also use a fisher test for this by constructing a contingency table: exp1: # DE, # not DE, exp2: # DE, # not DE. You can then extend this to the ChIP Seq data but classify genes as bound and not bound. I think there was some discussion here previously about how to combine P-values for x/ChIP-Seq overlap and the y/ChIP-Seq overlap.