I am using DAVID's Functional Annotation Clustering analysis tool and I wonder how DAVID's algorithm test the null hypothesis that the enrichment of an annotation is purely by chance? Could anyone explain to me in simple way?
I am little confused because for example,
One of the annotation cluster has only 7 genes AND the enrichment score is 1.23 with p-values of 2.8E-2:
clustered terms are DNA-binding region:ETS (7 genes), Ets (7 genes), Domain:PNT(4 genes), ETS(7 genes), SAM PNT(4 genes)
But for another annotation cluster, there are 86 genes BUT the enrichment score is only 0.05 with p-values of 1.0E0:
Clustered terms are mitochondrial lumen (19 genes), mitochondrial matrix (19 genes), mitochondrion (86 genes), mitochondrial part (43 genes), mitochondrion (59 genes)
So higher number of overlapping genes in between each GOTERM doesn't necessarily means higher enrichment score and lower p-value? I am still confused to how the first annotation cluster above with only 7 genes overlap amongst GOTERMs has higher p-value than the second cluster where there are at least 19 genes overlapping amongst GOTERMs?
Thank you!
This values make sense to me: the higher the enrichment score the better and consequently, for higher enrichment scores you will receive lower p-Values. Because the p-Values specify the likelihood of receiving the corresponding enrichment score by chance.
The enrichment score depends on the fold-change (or intensity values) of you genes and not on the overlap. Thus is makes sense that you are able to gain a higher enrichment score with few genes. But it is hard to tell without knowing you data...