Question

Gene Ontology Categories

7

Entering edit mode

14.0 years ago

alittleboy ▴ 220

I am new to gene ontology (GO) analysis and I need help for the following question: We use standard hypergeometric method to find out the GO categories that rank on the very top (say, top 10), or as in the paper of Young et. al. 2010 in Genome Biology, can use GOseq to identify the top-ranked categories. My question is what are the criteria for ranking these categories? Are they based on p-values? I appreciate if someone can explain GO analysis briefly or offer some other sources for reference. Thanks!

gene statistics • 16k views

ADD COMMENT • link updated 14.0 years ago by Khader Shameer 18k • written 14.0 years ago by alittleboy ▴ 220

score 16 · Answer 1 · 2011-05-02

Before addressing your specific question, I would like to provide a short overview on Gene Ontology based enrichment analysis:

To perform biological enrichment analysis using ontologies you need following data:

List of genes perturbed in an experiment (say microarray, next-gen sequencing, proteomics etc)

Background list of genes for your study (this could be list of genes that you have used to derive the perturbed genes from microarray, ngs, proteomics etc. For example, list of genes in a microarray, genes in a given genome etc.)

An ontology (in this case, gene_ontology)

Gene Ontology Association file (In this file you can find GO terms from assigned to genes in lists mentioned in 1 and 2)

Note: there are several well-defined biological ontologies, but you may not find corresponding association data. For available list of GO association data see GOA

Enrichment analysis:

Enrichment calculations are classified into 3 categories by Huang et. al as singular enrichment analysis(SEA), gene set enrichment analysis (GSEA) and modular enrichment analysis (MEA). Basic difference between these three classes of enrichment algorithms are in the way the enrichment p-values are calculated.

In SEA-based approach, annotations terms of subset of genes are assessed one at a time against a list of background genes. An enrichment p-value is calculated by comparing the observed frequency of an annotation term with the frequency expected by chance and individual terms beyond the p-value cut-of (P-value ≤ 0.05). FunctAssociate and Onto-express are two SEA based enrichment analysis tools.

GSEA approaches are similar, but consider all genes during the enrichment analysis, instead of a pre-defined threshold based genes as in the SEA approach. GSEA from broad is an example of GSEA based tool.

MEA based programs like Ontologizer 2.0 and topGO use the relationship that exist between the annotations. These programs were reported to attain better sensitivity and specificity due to the consideration of GO term relationships.

These tools are based on similar Statistical / algorithmic concepts. See a review on 68 tools published in 2008 here, you can see minor-to-medium level differences in the way the nodes are treated, computation of the statistics etc. Statistical methods to derive P-value includes Fisher’s exact test, hypergeometric function, binomial test, χ2 test or combination of these methods.

You can use one of the R package / servers / command-line tools for performing such analysis. See the list of GO based tools compiled by AmiGO team here.

Now to your specific question: Q: what are the criteria for ranking these categories? Are they based on p-values?

A: Yes. They are P-value based. See section on SEA, GSEA and MEA for various methods to derive the P-value.

For a detailed overview of the concepts discussed in this answer see the following articles 1, 2, 3, 4, 5, 6

Ram · Answer 2 · 2011-05-01

5

Entering edit mode

14.0 years ago

David Quigley 11k

Typically a gene ontology enrichment analysis tests each category in the ontology using a statistic such as the hypergeometric test you mention. Results would then typically be ranked by the strength of the statistic, translated into a P value. Some tools attempt to provide more useful results by looking for results that are significant but farther from the root of the tree, working from the idea that if two results are called significant and one is more specific, then that will be more helpful than knowing that a more general term is enriched. If you're using a P value, your package should correct for the number of tests performed. There's a straightforward overview of this in the user guide for BiNGO, a GO enrichment tool that works as a plug-in for Cytoscape. See also this Nature Reviews: Genetics article for some cautionary information.

ADD COMMENT • link 14.0 years ago by David Quigley 11k

1

Entering edit mode

A gene can be annotated to many categories, both due to structure and due to one gene doing many things. You test by category, not by gene. If S is your set of genes and !S is every other gene in the annotation and CAT is the category, The 2x2 would be "S in CAT, !S in CAT, S not in CAT, !S not in CAT." Intuitively, this tests whether the proportion of S in CAT is different from the proportion of everything else in CAT. Usually you test for enrichment, just caring whether the proportion of S is greater than !S.

ADD REPLY • link 14.0 years ago by David Quigley 11k

0

Entering edit mode

To add on David's answer: You usually use a "Fisher exact test" (which is based on the hypergeometric distribution) http://en.wikipedia.org/wiki/Fisher%27s_exact_test

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 14.0 years ago by Pablo ★ 1.9k

0

Entering edit mode

Thank you Dave and Pablo for the detailed explanation! Now I understand GO enrichment analysis better, but I still have a question: the genes and GO categories are not one-to-one due to the GO hierarchy structure. If I understand correctly, the Fisher's exact requires a 2x2 contingency table and I imagine the table should be DE/non-DE for rows and category_i/all other categories for columns. If so, how can we build a table without assuming 1-1 correspondence of gene vs. category? Please correct me if I misunderstand the method. Thank you!

ADD REPLY • link 14.0 years ago by alittleboy ▴ 220

0

Entering edit mode

Thank you Dave for the clarification! It makes more sense to me now.

ADD REPLY • link 14.0 years ago by alittleboy ▴ 220

score 2 · Answer 3 · 2011-05-02

2

Entering edit mode

14.0 years ago

Chris Evelo 10k

Before you write your own tool it might a good idea to check what is already out there. There are really very many Go approaches, tools and algorithms. [?]This BioStar question[?] describes some of these.

Apart from looking further away from the tree (as David mentioned) you might also want to take into account that effects that you find further away do reoccur in the large classes. You might not want that since they are already taken into account, in which case you should do pruning. The GO_Elite tool that I mentioned in the question above does just that, and it is in fact Open Source so you could also use it as a starting point if you want to do even fancier stuff.

ADD COMMENT • link 14.0 years ago by Chris Evelo 10k

0

Entering edit mode

Thank you Chris! I will look into the GO_Elite tool you mentioned.

ADD REPLY • link 14.0 years ago by alittleboy ▴ 220

0

Entering edit mode

Couldn't edit my own (old) post. Wanted to add that a GO-Elite paper has now been published. It is at: http://dx.doi.org/10.1093/bioinformatics/bts366

ADD REPLY • link 12.8 years ago by Chris Evelo 10k