Hey everybody
I am working on correlating results from RNA-seq and ChIP-seq together.
To simplify it, let's say I have 2 datasets x
and y
which are represented by a list of DEGs each, identified by RNA-seq analysis, and another dataset z
which is represented by a list of genes that shows enrichment of a particular histone mark from ChIP-seq analysis.
My aim is to overlap these 3 datasets (x
,y
and z
) to evaluate how many genes overlap among all 3 datasets. I have done so by simply generating a Venn diagram.
However, I think I am missing statistical evidence. I am looking for a way to determine if the overlaps I got are statistically significant or not. Is there any statistical test which is suitable for this case? Or any other suggestion is highly appreciated.
Thanks in advance!
This is a fairly simple problem if you simulate it.
The null hypothesis is that your overlap is indistinguishable from if these gene sets were just randomly sampled. So to generate your null all you need to do is, for 10k+ iterations, randomly sample n genes for your x, y, and z datasets where n is the original dataset size. Your observed overlap should be equal to or greater than 95% of the simulated overlaps to reject the null at an alpha of 0.05.
For the universe of genes you're sampling from you should restrict that gene set in the same way that you did in your analysis. For example, if you only considered protein coding genes in your RNA-seq analysis exclude everything but those from your gene universe.