Hi, I got stuck with an issue in analyzing my RNA-seq data. I have three tratments (Control, D1 and D2), and within each tratment I calculate DE genes between two groups (G1 and G2). Then I identify genes that become DE in both D1 and D2 compared to the control. Summarizing, I have (see also Figure):
- 15,000 genes representing the entire transcriptome
- 1,000 genes representing those that become DE as effect of treatment, filtering those that are DE in the control. This group is made as union of D1 and D2 DE genes, respectively 450 and 600 genes.
- Intersecting the DE genes, I found that 50 of them are common to both D1 and D2, 400 are specific to D1 and 550 are specific to D2.
Now I want to understand if this overlap is due by chance or not. So I'd like to have a feedback on my solution and, if wrong, a possible answer to my problem.
I have my observation: out of 1,000 DE genes common to both D1 and D2, 50 of them overlap. Then, I drawn 100 genes from both D1 and D2, and I counted how many overlap (bootstrapped 1,000 times). Let's say that the result of bootstrapped drawing is 10. Finally, I tested if the overlap observed and the one got by random drawing is different through a hypergeometric distribution.
I've done everything with R, using phyper
as follows:
Here I have to use 4 values:
x = 10 (overlapping genes from drawing)
m = 50 (overlapping observed)
n = 950 (total DE genes - m)
k = 100 (number of genes drawn from D1 and D2)
lower.tail=FALSE
I got 0.007
which, if I understood correctly and my procedure is correct as well, means that the overlap I observe between D1 and D2 is not due by chance.
Any opinion about my approach?
Thanks!
There also exists a Bioconductor package for these type of tests, GeneOverlap
Thanks for your reply! Can you clarify me which is the weakness of my approach: comparing the overlap between D1 and D2 to the overlap between two subgroups randmoly drawn from both D1 and D2? My supervisor suggested an approach similar to yours, so I acknoledge that my way is wrong, but I'm still failing to see the weakness of my approach.
If you randomly draw an infinite number of times then you'll approach the p-value from fisher's test.