Genomic Control Sets For Tests Of Annotational Enrichment
2
2
Entering edit mode
13.7 years ago

When testing for some genomic annotational features (e.g. gene enrichment), often a control set of random sites is computer-generated for comparison.

The questions:

  • why not just use the entire genome as a control - since that is the entire population. What is the use of introducing sampling error at this step?
  • The size of a control set - should it be 10X the size of the experimental set? What are some heuristics for choosing a size?
  • The appropriate statistic for comparing discrete annotational counts (Fisher's Exact Test, chi-square test, or glm)
statistics • 4.6k views
ADD COMMENT
0
Entering edit mode

let's stick to NGS for this discussion. Microarrays have their own headaches.

ADD REPLY
2
Entering edit mode
13.7 years ago

**Why not just use the entire genome as a control - since that is the entire population. What is the use of introducing sampling error at this step?

There is absolutely no issue in using annotations from whole genome. Definition of the population often depends on your experimental platform (say all genes in a microarray, whole exomes in case of exome sequencing, entire set of genes with annotations in case of genome-wide annotation enrichment.) Based on the background, enrichment calculations are classified into 3 categories as singular enrichment analysis(SEA), gene set enrichment analysis (GSEA) and modular enrichment analysis (MEA). Basic difference between these three classes of enrichment algorithms are in the way the enrichment p-values are calculated. (see Heng et. al)

In SEA-based approach, annotations terms of subset of genes are assessed one at a time against a list of background genes. An enrichment p-value is calculated by comparing the observed frequency of an annotation term with the frequency expected by chance and individual terms beyond the p-value cut-of (P-value ≤ 0.05). FunctAssociate and Onto-express are two SEA based enrichment analysis tools.

GSEA approaches are similar, but consider all genes during the enrichment analysis, instead of a pre-defined threshold based genes as in the SEA approach. GSEA from broad is an example of GSEA based tool.

MEA based programs like Ontologizer 2.0 and topGO use the relationship that exist between the annotations. These programs were reported to attain better sensitivity and specificity due to the consideration of GO term relationships.

The size of a control set - should it be 10X the size of the experimental set? What are some heuristics for choosing a size?

I haven't heard of a well-defined size for the control set. In enrichment calculation you often have a background population (X) of genes with Y annotations and perturbed set of genes from the > population (x) with y annotations. You will be using standard statistical tests / MTC to derive the p-value.

The appropriate statistic for comparing discrete annotational counts (Fisher's Exact Test, chi-square test, or glm)

Fisher's Exact Test / Chi-square test are often used. Statistical / algorithmic concepts are similar among various enrichment calculation tools. For a detailed overview of GO based enrichment > calculation methods see a review on 68 tools published in 2008. You can see minor-to-medium level > differences in the way the nodes in GO DAGs are treated, computation of the statistics etc. Statistical methods to derive P-value includes Fisher’s exact test, hypergeometric function, binomial test, χ2 test or combination of these methods.

PS. This is adapted from one of my another answer

ADD COMMENT
0
Entering edit mode

do the statistics change when you are using an entire known "population" of features (i.e. all 3 billion genomic sites, all genes) vs. a sample? or is it considered just a large sample?

ADD REPLY
0
Entering edit mode

Yes. I have done some analysis using different levels of annotations (using GO and other non-DAG based annotations like protein domains, disease etc; unpublished data) I noticed that depending upon the definition of background population the p-value deviates.

ADD REPLY
0
Entering edit mode

let's stick to the basic types of genomic annotation: genes, cpg islands, repeats. If I have 60% of reads fall into genes (TUs) and we know 33% of the genome is in TUs, why should I bother with a sampling of control sites?

ADD REPLY
0
Entering edit mode

If you are interested in finding the enrichment of genes, cpg islands, repeats, you only need a background population with the annotations and perturbed subset of population that you would like to analyze for enrichment w.r.to the background. It is not clear to me why you need to do an explicit sampling of control sites in enrichment analysis. Also am not sure what do you mean by TUs here.

ADD REPLY
0
Entering edit mode

TU are transcription units (genes)

ADD REPLY
0
Entering edit mode

Thanks Jeremy. I would like to point you to the statistical test performed in GREAT too, see section on bionomial test http://great.stanford.edu/help/index.php/Statistics#What_is_the_binomial_test_formally.3F that scenario sounds similar to what you have. We only need population in base pair, subset of perturbed base pair and the annotations. I am still not getting why you need a control data in the case of annotation enrichment, may be am missing something important.

ADD REPLY
1
Entering edit mode
13.7 years ago
Will 4.6k

There are certainly competing views for how to answer these questions ... So I'll present my thoughts but take them with a grain of salt.

  • why not just use the entire genome as a control - since that is the entire population. What is the use of introducing sampling error at this step?

Because this may not be appropriate. For example in microarray experiments not every gene in the genome is measured. So the background should be all genes on the chip (everything that you could have measured). With modern chips this is not as big of a deal since most chips contain all genes but when dealing with historical data (or data from custom chips ... ie. chips which only measure "drug-able targets") its very important to know what the background is.

  • The size of a control set - should it be 10X the size of the experimental set? What are some heuristics for choosing a size?

I assume by "control set" you mean the "normal" samples. In order to calculate this you need some idea of the effect size that you're actually trying to measure. Obviously the smaller the effect the more samples you'll need. However, I have yet to see a genomics experiment which actually sets forth their reasoning for choosing the number of experimental-to-normal samples ... most are simply chosen based on budgetary reasons.

  • The appropriate statistic for comparing discrete annotational counts (Fisher's Exact Test, chi-square test, or glm)

I personally prefer hypergeometric or Fisher's Exact (depending on size constraints) however nowadays I don't actually calculate my own enrichment values. Keeping annotations up-to-date is a full time job. So I use the DAVID tool which calculates a hypergeo, fisher's test and an EASE score which uses some sort of multiple-testing correction.

Hope that helps.

ADD COMMENT
0
Entering edit mode

the control set I am talking about is a computer-generated random set of genomic sites, so it is basically trivial to generate 1 million or 1 billion

ADD REPLY
0
Entering edit mode

Oh ... If you mean that type of control then I would say the only limit is the computer time you're willing to dedicate. Just be careful because the lowest p-value you can resolve is 1/num-reps

ADD REPLY

Login before adding your answer.

Traffic: 2266 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6