Question

DAVID Functional Annotation Clustering Analysis

1

Entering edit mode

9.8 years ago

parksuhong ▴ 10

I am using DAVID's Functional Annotation Clustering analysis tool and I wonder how DAVID's algorithm test the null hypothesis that the enrichment of an annotation is purely by chance? Could anyone explain to me in simple way?

I am little confused because for example,

One of the annotation cluster has only 7 genes AND the enrichment score is 1.23 with p-values of 2.8E-2:

clustered terms are DNA-binding region:ETS (7 genes), Ets (7 genes), Domain:PNT(4 genes), ETS(7 genes), SAM PNT(4 genes)

But for another annotation cluster, there are 86 genes BUT the enrichment score is only 0.05 with p-values of 1.0E0:

Clustered terms are mitochondrial lumen (19 genes), mitochondrial matrix (19 genes), mitochondrion (86 genes), mitochondrial part (43 genes), mitochondrion (59 genes)

So higher number of overlapping genes in between each GOTERM doesn't necessarily means higher enrichment score and lower p-value? I am still confused to how the first annotation cluster above with only 7 genes overlap amongst GOTERMs has higher p-value than the second cluster where there are at least 19 genes overlapping amongst GOTERMs?

Thank you!

RNA-Seq ChIP-Seq sequencing • 6.0k views

ADD COMMENT • link updated 2.9 years ago by Ram 45k • written 9.8 years ago by parksuhong ▴ 10

0

Entering edit mode

This values make sense to me: the higher the enrichment score the better and consequently, for higher enrichment scores you will receive lower p-Values. Because the p-Values specify the likelihood of receiving the corresponding enrichment score by chance.

The enrichment score depends on the fold-change (or intensity values) of you genes and not on the overlap. Thus is makes sense that you are able to gain a higher enrichment score with few genes. But it is hard to tell without knowing you data...

ADD REPLY • link 9.8 years ago by Manuel Landesfeind ★ 1.4k

Ram · Answer 1 · 2015-10-06

This has most likely to do with your sample size (effect size). One should be very careful when small absolute number of genes is used in such analysis. roughly speaking, going from 1 to 2 is doubling by adding only 1. going from 10 to 20 is also doubling but by adding 10. This is not the same also you double in both. Maybe this will help: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3444174/