**Why not just use the entire genome as a control - since that is the entire population. What is the use of introducing sampling error at this step?
There is absolutely no issue in using annotations from whole genome. Definition of the population often depends on your experimental platform (say all genes in a microarray, whole exomes in case of exome sequencing, entire set of genes with annotations in case of genome-wide annotation enrichment.) Based on the background, enrichment calculations are classified into 3 categories as singular enrichment analysis(SEA), gene set enrichment analysis (GSEA) and modular enrichment analysis (MEA). Basic difference between these three classes of enrichment algorithms are in the way the enrichment p-values are calculated. (see Heng et. al)
In SEA-based approach, annotations terms of subset of genes are assessed one at a time against a list of background genes. An enrichment p-value is calculated by comparing the observed frequency of an annotation term with the frequency expected by chance and individual terms beyond the p-value cut-of (P-value ≤ 0.05). FunctAssociate and Onto-express are two SEA based enrichment analysis tools.
GSEA approaches are similar, but consider all genes during the enrichment analysis, instead of a pre-defined threshold based genes as in the SEA approach. GSEA from broad is an example of GSEA based tool.
MEA based programs like Ontologizer 2.0 and topGO use the relationship that exist between the annotations. These programs were reported to attain better sensitivity and specificity due to the consideration of GO term relationships.
The size of a control set - should it be 10X the size of the experimental set? What are some heuristics for choosing a size?
I haven't heard of a well-defined size for the control set. In enrichment calculation you often have a background population (X) of genes with Y annotations and perturbed set of genes from the > population (x) with y annotations. You will be using standard statistical tests / MTC to derive the p-value.
The appropriate statistic for comparing discrete annotational counts (Fisher's Exact Test, chi-square test, or glm)
Fisher's Exact Test / Chi-square test are often used. Statistical / algorithmic concepts are similar among various enrichment calculation tools. For a detailed overview of GO based enrichment > calculation methods see a review on 68 tools published in 2008. You can see minor-to-medium level > differences in the way the nodes in GO DAGs are treated, computation of the statistics etc. Statistical methods to derive P-value includes Fisher’s exact test, hypergeometric function, binomial test, χ2 test or combination of these methods.
PS. This is adapted from one of my another answer
let's stick to NGS for this discussion. Microarrays have their own headaches.