Question

Genomic Control Sets For Tests Of Annotational Enrichment

2

Entering edit mode

14.2 years ago

Jeremy Leipzig 23k

When testing for some genomic annotational features (e.g. gene enrichment), often a control set of random sites is computer-generated for comparison.

The questions:

why not just use the entire genome as a control - since that is the entire population. What is the use of introducing sampling error at this step?
The size of a control set - should it be 10X the size of the experimental set? What are some heuristics for choosing a size?
The appropriate statistic for comparing discrete annotational counts (Fisher's Exact Test, chi-square test, or glm)

statistics • 5.2k views

ADD COMMENT • link updated 14.2 years ago by Khader Shameer 18k • written 14.2 years ago by Jeremy Leipzig 23k

0

Entering edit mode

let's stick to NGS for this discussion. Microarrays have their own headaches.

ADD REPLY • link 14.2 years ago by Jeremy Leipzig 23k

Ram · Answer 1 · 2011-05-03

**Why not just use the entire genome as a control - since that is the entire population. What is the use of introducing sampling error at this step?

There is absolutely no issue in using annotations from whole genome. Definition of the population often depends on your experimental platform (say all genes in a microarray, whole exomes in case of exome sequencing, entire set of genes with annotations in case of genome-wide annotation enrichment.) Based on the background, enrichment calculations are classified into 3 categories as singular enrichment analysis(SEA), gene set enrichment analysis (GSEA) and modular enrichment analysis (MEA). Basic difference between these three classes of enrichment algorithms are in the way the enrichment p-values are calculated. (see Heng et. al)

In SEA-based approach, annotations terms of subset of genes are assessed one at a time against a list of background genes. An enrichment p-value is calculated by comparing the observed frequency of an annotation term with the frequency expected by chance and individual terms beyond the p-value cut-of (P-value ≤ 0.05). FunctAssociate and Onto-express are two SEA based enrichment analysis tools.

GSEA approaches are similar, but consider all genes during the enrichment analysis, instead of a pre-defined threshold based genes as in the SEA approach. GSEA from broad is an example of GSEA based tool.

MEA based programs like Ontologizer 2.0 and topGO use the relationship that exist between the annotations. These programs were reported to attain better sensitivity and specificity due to the consideration of GO term relationships.

The size of a control set - should it be 10X the size of the experimental set? What are some heuristics for choosing a size?

I haven't heard of a well-defined size for the control set. In enrichment calculation you often have a background population (X) of genes with Y annotations and perturbed set of genes from the > population (x) with y annotations. You will be using standard statistical tests / MTC to derive the p-value.

The appropriate statistic for comparing discrete annotational counts (Fisher's Exact Test, chi-square test, or glm)

Fisher's Exact Test / Chi-square test are often used. Statistical / algorithmic concepts are similar among various enrichment calculation tools. For a detailed overview of GO based enrichment > calculation methods see a review on 68 tools published in 2008. You can see minor-to-medium level > differences in the way the nodes in GO DAGs are treated, computation of the statistics etc. Statistical methods to derive P-value includes Fisher’s exact test, hypergeometric function, binomial test, χ2 test or combination of these methods.

PS. This is adapted from one of my another answer

score 1 · Answer 2 · 2011-05-03

There are certainly competing views for how to answer these questions ... So I'll present my thoughts but take them with a grain of salt.

why not just use the entire genome as a control - since that is the entire population. What is the use of introducing sampling error at this step?

Because this may not be appropriate. For example in microarray experiments not every gene in the genome is measured. So the background should be all genes on the chip (everything that you could have measured). With modern chips this is not as big of a deal since most chips contain all genes but when dealing with historical data (or data from custom chips ... ie. chips which only measure "drug-able targets") its very important to know what the background is.

The size of a control set - should it be 10X the size of the experimental set? What are some heuristics for choosing a size?

I assume by "control set" you mean the "normal" samples. In order to calculate this you need some idea of the effect size that you're actually trying to measure. Obviously the smaller the effect the more samples you'll need. However, I have yet to see a genomics experiment which actually sets forth their reasoning for choosing the number of experimental-to-normal samples ... most are simply chosen based on budgetary reasons.

The appropriate statistic for comparing discrete annotational counts (Fisher's Exact Test, chi-square test, or glm)

I personally prefer hypergeometric or Fisher's Exact (depending on size constraints) however nowadays I don't actually calculate my own enrichment values. Keeping annotations up-to-date is a full time job. So I use the DAVID tool which calculates a hypergeo, fisher's test and an EASE score which uses some sort of multiple-testing correction.

Hope that helps.