goverlap

Question

Tool:goverlap: modern ontology term enrichment analyzer (DAVID/GOstats for the command line)

4

Entering edit mode

9.2 years ago

Endre Bakken Stovner ▴ 970

goverlap

goverlap is a modern ontology term enrichment analyzer.

While there exist many similar tools - such as the R library GOstats or the immensely popular online tool DAVID - goverlap contains an integrated package not found elsewhere. It

supports two input lists to find whether there are significant differences in term enrichment between experiments
contains several ontologies, and can easily be extended with more
has a command line interface, which makes it trivial to analyze and compare any number of files
does not require any setup or manual downloads
is open-source

History

goverlap is a command line reimplementation of the now defunct web tool eGOn (prev. GeneTools) described in the paper GeneTools - application for functional annotation and statistical hypothesis testing.

High-level overview

The result of biological studies is often a list of genes that are found to be differentially expressed in the experimental condition. But to understand your results, a list of non-sensical gene names is of little help. You want to know what happened in the experiment.

Luckily, for most genes, several of their biological functions are known. This knowledge exists in annotations that map gene names to functions.

This makes it possible to turn a list of genes (your results) into the biological functions that were affected in the experimental condition.

In short, you go from gobbeldygook like this:

Tubb2b
Tubb5
Crmp1
# several hundred more gene names...

to something a biologist can understand, like

Neuroactive ligand-receptor interaction or Calcium signaling pathway.

Detailed explanation

There are tens of thousands of categories of biological functions and processes, written down in classification schemes called ontologies. Three randomly drawn examples of such categories are:

Generation of a long process of a CNS neuron, amine transport and antigen receptor-mediated signaling pathway

These all have known sets of genes associated with them.

For each such category, goverlap finds the number of genes n in your input gene list that is represented in the list of genes associated with that ontological category. Then it computes the probability that n genes from your gene list would be found associated with that ontological category by chance. If this probability is low, your genes are likely to be involved in whatever function that ontology category represents.

How goverlap tells you this is shown in the example output below:

go_root ontology    a_hpgm_go_fdr   a_odds  a   a_I_go  go_subtree  U   a_hpgm_go   go_annotation
GO:0005515  MF  2.71e-55   1.05   1281    808 9038    21941   1.8e-59   protein binding
GO:0005488  MF  1.61e-52   1.78   1281    998 12610   21941   2.2e-56   binding
...

We see the corrected p-value (third column) is very low for both the shown categories. This means that your input genes are very likely to be involved in protein binding and the binding categories (Note that the actual list would contain thousands of more rows, one for each ontology category.)

An explanation of each column in the output:

go_root is the category for which we are investigating possible over- or under-representation
ontology is the classification scheme in which the category exists
a_hpgm_go_fdr is the value a_hpgm_go corrected for multiple testing with Benjamini-Hochberg
a_odds is the odds ratio which tells the degree and magnitude of over- or under-representation
a is the number of genes in the input gene list
a_I_go is the number of genes both in the input and associated with the category
go_subtree is the number of genes associated with the category
U is the total number of all genes used in the experiment
a_hpgm_go is the p-value for the probability of the input genes being overrepresented in the category, computed with the hypergeometric test
go_annotation is a description of the category

Two input files

goverlap accepts two input gene lists. In this case it computes the statistics listed above for each input file. In addition, it also computes the probability that there is a statistically significant difference in the degree the gene lists overlap with each ontology category.

The two-list output looks like this:

go_root ontology    chi_sq_fdr  a_hpgm_go_fdr   a_odds  b_hpgm_go_fdr   b_odds  a_I_b_I_go  a_M_b_I_go  b_M_a_I_go  go_subtree  U   a_I_b   a_M_b   b_M_a   chi_sq  test_obs    b_hpgm_go   a_hpgm_go   go_annotation
GO:0034645  BP  1.4e-08   8.7e-49   0.29    8.2e-09   0.3  104 744 53  3645    21941   4238    2957    277 6.1e-11   42.79   1.1e-10   4.91e-51   cellular macromolecule biosynthetic process
GO:0048869  BP  3.5e-08   3.39e-84   0.28  3.49e-22    0.29  105 739 67  3169    21941   4238    2957    277 1.6e-10   40.9   7.2e-25   6.8e-87   cellular developmental process
...

An explanation of each column in the output that does not exist in the single list case:

chi_sq_fdr is the value chi_sq corrected for multiple testing
a_I_b_I_go is the number of genes in the intersection of gene list a and b and the ontology category
a_M_b_I_go is the number of genes in the intersection of gene list a and the ontology category, but not in gene list b
b_M_a_I_go is the number of genes in the intersection of gene list b and the ontology category, but not in gene list a
a_M_b is the number of genes in gene list a, but not in gene list b
b_M_a is the number of genes in gene list b, but not in gene list a
chi_sq is the probability that genes from the category are equally likely to come from list A or B
test_obs is the test statistic

Note that the statistical test used for two-list comparisons is different, depending on their set-theoretical relationship. There are four different cases:

The lists are overlapping

In this case the Leisering, Alonzo and Pepe test (LAP) is used, since Fisher's exact is inappropriate when the categories are not mutually exclusive.

The lists are mututally exclusive

In this case Fisher's Exact is used.

One list is a proper subset of the other

This is an exceedingly rare case, where Fisher's exact can be used (but this is not implemented yet.) One list being a proper subset of another is expected only to happen in artificial situations, where the lists represent the same results, only with different p-value cutoffs.

When Fisher's Exact test is used, the columns chi_sq_fdr, chi_sq, test_obs are replaced withfisher_fdr and fisher.

The lists are exactly equal

In the case of equal lists, no comparison is done and goverlap executes as if only one list was input.

Install

$ pip install goverlap

Command line interface

goverlap

goverlap is a modern ontology term enrichment analyzer
(Visit github.com/endrebak/goverlap for examples, docs and help.)

Usage:
    goverlap --genes-a=A [--genes-b=B] [--experiment=EXP] [--species=SPE] [--ontologies=ONT] [--limit=LIM] [--ncpus=CPU] [--verbosity=VRB]
    goverlap --help

Arguments:
    -a A --genes-a=A         newline-separated list of the genes in experiment A

Options:
    -h --help                show this message
    -b B --genes-b=B         newline-separated list of the genes in experiment B
    -s SPE --species=SPE     species to do analysis on (human, mouse or rat) [default: human]
    -o ONT --ontologies=ONT  comma-separated list of ontologies [default: CC,MF,BP,KEGG]
    -l LIM --limit=LIM       remove categories where the number of genes are more than LIM percent of all genes [default: 0.25]
    -x EXP --experiment=EXP  newline-separated list of all the genes in the experiment
    -v VRB --verbosity VRB   options: quiet, info, debug [default: info]
    -n CPU --ncpus=CPU       number of cores used [default: 1]

Note:
    The gene names need to be of the kind called "external_gene_name"/"associated gene name" in BioMart.
    If they are in any other format, you need to use a BioMart id converter to convert them to this format.
    (see BioMart.org or github.com/endrebak/biomartian)

Example

# input files (these can be found in the `examples` directory):
$ head -3 examples/Simes_Binner_ECIIvsDeep.txt
Ablim2
Elp5
Vmac
$ head -3 examples/FeatureCount_ECIIvsDeep.txt
Slc12a4
Lcat
Tle4
$ goverlap -a examples/FeatureCount_ECIIvsDeep.txt \ # first input list
           -b examples/Simes_Binner_ECIIvsDeep.txt \ # second input list
           -s rat \ # the species name; used for gene and ontology lookups
           -v debug \ # how much output about the running process to print to stderr
           -n 30 -o CC,KEGG \ # which ontologies to use
           -l 0.10 \ # remove any ontology category where the number of associated genes is
                   \ # more than 10% of the total number of genes.
           -x examples/experiment_genes.txt # restrict the genes considered to this list

This produces the output

# /local/home/endrebak/anaconda/bin/goverlap -a examples/FeatureCount_ECIIvsDeep.txt -b examples/Simes_Binner_ECIIvsDeep.txt -s rat -v debug -n 30 -o CC -l 0.1 -x examples/experiment_genes.txt
go_root ontology    chi_sq_fdr  a_hpgm_go_fdr   a_odds  b_hpgm_go_fdr   b_odds  a_I_b_I_go  a_M_b_I_go  b_M_a_I_go  go_subtree  U   a_I_b   a_M_b   b_M_a   chi_sq  test_obs    b_hpgm_go   a_hpgm_go   go_annotation
GO:0030425  CC  1.0 0.001    0.03 4.38e-11   0.05 184 84  7   369 15945   6170    4073    131 0.8  0.06  4.18e-13   4.04e-05   dendrite
GO:0000813  CC  1.0 1.0 0.0001   1.0 0.0 1   1   0   8   15945   6170    4073    131 0.9  5.3e-08   1.0 0.9  ESCRT I complex
GO:0034098  CC  1.0 1.0 0.0003    1.0 0.0004   2   1   0   6   1594    6170    4073    131 0.99  1.05e-07   0.57    0.7  VCP-NPL4-UFD1 AAA ATPase complex
GO:0034666  CC  1.0 0.7  0.0003    0.16  0.0009   2   0   0   2   15945   6170    4073    131 1.0 0.0 0.02 0.14  integrin alpha2-beta1 complex
GO:0033256  CC  1.0 0.15  0.001   0.34  0.001    6   3   0   7   15945   6170    4073    131 0.9   2.8e-06   0.05 0.01 I-kappaB/NF-kappaB complex
GO:0000308  CC  1.0 1.0 0.0002   0.9  0.0005   1   0   0   2   15945   6170    4073    131 1.0 0.0 0.2  0.6  cytoplasmic cyclin-dependent protein kinase holoenzyme complex
GO:0044308  CC  1.0 1.0 0.0002   1.0 0.0 1   1   0   2   15945   6170    4073    131 0.9  5.2e-08   1.0 0.6  axonal spine
GO:0032591  CC  1.0 0.3  0.0008   0.3  0.001    5   2   0   6   15945   6170    4073    131 0.9  1.05e-06   0.03 0.03  dendritic spine membrane
...

Requirements

Tested on both python versions 2.7 and 3.4.

All the required libraries are installed automatically when fetching goverlap from PyPI.

pip install goverlap # also installs pandas, numpy/scipy, joblib, godb, biomartian, kg, ebs

To run the tests you also need py.test.

Tests

sh tests/run_unit_tests.sh and sh tests/run_integration_tests.sh.

Report any failing tests on the issues page.

Papers or studies using goverlap

Several high-profile papers to be added (will probably take years, though.)

term-enrichment • 4.9k views

ADD COMMENT • link updated 2.2 years ago by Ram 44k • written 9.2 years ago by Endre Bakken Stovner ▴ 970

1

Entering edit mode

To improve the readibility of the examples in this post I rounded all the decimal numbers to one or two digits. If you don't like it I can revert it back to the previous version.

ADD REPLY • link 9.2 years ago by Giovanni M Dall'Olio 28k

Ram · Answer 1 · 2015-09-02

1

Entering edit mode

9.2 years ago

Giovanni M Dall'Olio 28k

This is wonderful, congratulations!

I have two technical questions:

1. where do the GO data comes from? For example, some R packages use the AnnotationDBI dataset from Bioconductor, so their data is always synchronized to a recent version. How do you deal with it in python? Does goverlap includes a snapshot of GO, or the data is downloaded every time?

2. Are you going to implement any plots to visualize the enrichment in the future?

ADD COMMENT • link 9.2 years ago by Giovanni M Dall'Olio 28k

1

Entering edit mode

I am using godb which downloads and parses the obo file from the gene ontology website. It is cached eternally, but if you want to use a new version just delete ~/.godb where the downloaded and cached data is stored (See https://github.com/endrebak/godb ) and it will be downloaded again.

All the data is downloaded (biomart, kegg, go) and then cached, so if you suspect the data is stale, just delete the respective cache directory.
Might do it if people have ideas for plots that might be interesting and easy to make (This really something R is better at, but would be interesting to try out https://github.com/mwaskom/seaborn ).

ADD REPLY • link updated 2.2 years ago by Ram 44k • written 9.2 years ago by Endre Bakken Stovner ▴ 970

0

Entering edit mode

Thank you very much! godb is a very interesting project, I had missed it. Maybe you can add a note in the documentation.

For the plots, I like the ones implemented in DOSE/clusterProfiler

ADD REPLY • link 9.2 years ago by Giovanni M Dall'Olio 28k