If you want to know what the chance is for exactly q+1 number of overlapping genes, you should subtract the the phyper function like this:
phyper(q, m, n, k, lower.tail = F) - phyper(q+1, m, n, k, lower.tail = F)
(with q the number of overlaps -1)
This is because phyper gives the chance for q+1 overlaps OR MORE. So subtracting q+2 will give you the probabability of EXACTLY q+1 overlaps.
I checked it with the following script which prints the probability distribution. The total sum of probabilities is exactly 1 here.
tot = 0
cat('o','p',sep=' ')
for (hits in 0:100){tot = tot + (phyper(q=hits-1,m=100,n=20000-100,k=100, lower.tail=F)-phyper(q=hits,m=100,n=20000-100,k=100, lower.tail=F))
cat(hits,phyper(q=hits-1,m=100,n=20000-100,k=100, lower.tail=F)-phyper(q=hits,m=100,n=20000-100,k=100, lower.tail=F),'\n',sep=' ')}
tot
I thought I'd mention this, because while I was reading this I was under the impression that the original answer asked for the chance for an EXACT number of overlaps.
Here is a good post on the stackexchange stats Q&A about hypergeometric for list overlap.
yes this is the correct way to do this.
Hello,
I just have a similar question: Can I use R phyper in the above way to calculate the probability of overlap occurring by chance between two lists of hit genes identified by two different algorithmsg on the same tissue?
Thank you very much!
I think yes, as long as the total number of genes is identical between both tries. Note that the probability is those of two independent random draws ('totally random'), that is maybe not such a good benchmark for a comparison of two algorithms.
are the 2 lists from the same cell type?