Suppose I have a list of genes
mygenes: gene13,gene2,gene111.
And given another list of genes
gene_categoryA: gene1, gene2, gene44, gene111.
gene_categoryB: gene13,gene34.
After comparing mygenes
against gene_categoryA
we see that there are 2 genes of categoryA
in mygenes
.
What I want to know whether these 2 genes (gene2 and gene111) occurrence is significantly more than expected.
What's is the best way to go about it.
Thanks Damian. What if I have more than 2 gene category to compare, e.g.
gene_categoryA,...gene_categoryK
. Otherway to look at it is that now in the urn the balls are not only red and black, but more colours. How can I modified your code with that? The task is still the same, namely to check whether my set of gene is significantly from gene_categoryA.You would have to use a multivariate hypergeometric distribution. I am not sure if scipy has that function.
if the question is the same (i.e. Check whether the set of genes is significantly from gene_categoryA) then I don't see why it should matter how many categories are there, after all we can abstract all those as "non A categories" and proceed the same way to calculate the probability for category A to be over represented in our the gene list. Am I missing something here?
@Damian: I was wondering why you are subtracting
1
fromarg[4]
when you are calculating thesurvival function
. The same type of question can be asked for adding1
toarg[4]
when calculating theCDF
? Is it because we are working with discrete values and to include the instanceX=x
in the calculation we have to either add or subtract1
?I always forget how these two functions goes (cdf,sf) in terms of whether it is off by one or not when you are want to do > or >=.
I think I got an e-mail from someone asking this same question earlier this year. It turns out the p-value <= than portion of the above script already calculates <=, so it is unnecessary to add 1. The p-value >= portion is still correct since the sf (survival function) calculates >.
I've edited the post to reflect this. I feel bad now for propagating wrong information.