goverlap
goverlap is a modern ontology term enrichment analyzer.
While there exist many similar tools - such as the R library GOstats or the immensely popular online tool DAVID - goverlap contains an integrated package not found elsewhere. It
- supports two input lists to find whether there are significant differences in term enrichment between experiments
- contains several ontologies, and can easily be extended with more
- has a command line interface, which makes it trivial to analyze and compare any number of files
- does not require any setup or manual downloads
- is open-source
History
goverlap is a command line reimplementation of the now defunct web tool eGOn
(prev. GeneTools
) described in the paper GeneTools - application for functional annotation and statistical hypothesis testing.
High-level overview
The result of biological studies is often a list of genes that are found to be differentially expressed in the experimental condition. But to understand your results, a list of non-sensical gene names is of little help. You want to know what happened in the experiment.
Luckily, for most genes, several of their biological functions are known. This knowledge exists in annotations that map gene names to functions.
This makes it possible to turn a list of genes (your results) into the biological functions that were affected in the experimental condition.
In short, you go from gobbeldygook like this:
Tubb2b
Tubb5
Crmp1
# several hundred more gene names...
to something a biologist can understand, like
Neuroactive ligand-receptor interaction
or Calcium signaling pathway
.
Detailed explanation
There are tens of thousands of categories of biological functions and processes, written down in classification schemes called ontologies. Three randomly drawn examples of such categories are:
Generation of a long process of a CNS neuron
, amine transport
and antigen receptor-mediated signaling pathway
These all have known sets of genes associated with them.
For each such category, goverlap finds the number of genes n
in your input gene list that is represented in the list of genes associated with that ontological category. Then it computes the probability that n
genes from your gene list would be found associated with that ontological category by chance. If this probability is low, your genes are likely to be involved in whatever function that ontology category represents.
How goverlap tells you this is shown in the example output below:
go_root ontology a_hpgm_go_fdr a_odds a a_I_go go_subtree U a_hpgm_go go_annotation
GO:0005515 MF 2.71e-55 1.05 1281 808 9038 21941 1.8e-59 protein binding
GO:0005488 MF 1.61e-52 1.78 1281 998 12610 21941 2.2e-56 binding
...
We see the corrected p-value (third column) is very low for both the shown categories. This means that your input genes are very likely to be involved in protein binding
and the binding
categories (Note that the actual list would contain thousands of more rows, one for each ontology category.)
An explanation of each column in the output:
go_root
is the category for which we are investigating possible over- or under-representationontology
is the classification scheme in which the category existsa_hpgm_go_fdr
is the valuea_hpgm_go
corrected for multiple testing withBenjamini-Hochberg
a_odds
is the odds ratio which tells the degree and magnitude of over- or under-representationa
is the number of genes in the input gene lista_I_go
is the number of genes both in the input and associated with the categorygo_subtree
is the number of genes associated with the categoryU
is the total number of all genes used in the experimenta_hpgm_go
is the p-value for the probability of the input genes being overrepresented in the category, computed with the hypergeometric testgo_annotation
is a description of the category
Two input files
goverlap accepts two input gene lists. In this case it computes the statistics listed above for each input file. In addition, it also computes the probability that there is a statistically significant difference in the degree the gene lists overlap with each ontology category.
The two-list output looks like this:
go_root ontology chi_sq_fdr a_hpgm_go_fdr a_odds b_hpgm_go_fdr b_odds a_I_b_I_go a_M_b_I_go b_M_a_I_go go_subtree U a_I_b a_M_b b_M_a chi_sq test_obs b_hpgm_go a_hpgm_go go_annotation
GO:0034645 BP 1.4e-08 8.7e-49 0.29 8.2e-09 0.3 104 744 53 3645 21941 4238 2957 277 6.1e-11 42.79 1.1e-10 4.91e-51 cellular macromolecule biosynthetic process
GO:0048869 BP 3.5e-08 3.39e-84 0.28 3.49e-22 0.29 105 739 67 3169 21941 4238 2957 277 1.6e-10 40.9 7.2e-25 6.8e-87 cellular developmental process
...
An explanation of each column in the output that does not exist in the single list case:
chi_sq_fdr
is the valuechi_sq
corrected for multiple testinga_I_b_I_go
is the number of genes in the intersection of gene list a and b and the ontology categorya_M_b_I_go
is the number of genes in the intersection of gene list a and the ontology category, but not in gene list bb_M_a_I_go
is the number of genes in the intersection of gene list b and the ontology category, but not in gene list aa_M_b
is the number of genes in gene list a, but not in gene list bb_M_a
is the number of genes in gene list b, but not in gene list achi_sq
is the probability that genes from the category are equally likely to come from list A or Btest_obs
is the test statistic
Note that the statistical test used for two-list comparisons is different, depending on their set-theoretical relationship. There are four different cases:
The lists are overlapping
In this case the Leisering, Alonzo and Pepe test (LAP) is used, since Fisher's exact is inappropriate when the categories are not mutually exclusive.
The lists are mututally exclusive
In this case Fisher's Exact is used.
One list is a proper subset of the other
This is an exceedingly rare case, where Fisher's exact can be used (but this is not implemented yet.) One list being a proper subset of another is expected only to happen in artificial situations, where the lists represent the same results, only with different p-value cutoffs.
When Fisher's Exact test is used, the columns chi_sq_fdr
, chi_sq
, test_obs
are replaced withfisher_fdr
and fisher
.
The lists are exactly equal
In the case of equal lists, no comparison is done and goverlap executes as if only one list was input.
Install
$ pip install goverlap
Command line interface
goverlap
goverlap is a modern ontology term enrichment analyzer
(Visit github.com/endrebak/goverlap for examples, docs and help.)
Usage:
goverlap --genes-a=A [--genes-b=B] [--experiment=EXP] [--species=SPE] [--ontologies=ONT] [--limit=LIM] [--ncpus=CPU] [--verbosity=VRB]
goverlap --help
Arguments:
-a A --genes-a=A newline-separated list of the genes in experiment A
Options:
-h --help show this message
-b B --genes-b=B newline-separated list of the genes in experiment B
-s SPE --species=SPE species to do analysis on (human, mouse or rat) [default: human]
-o ONT --ontologies=ONT comma-separated list of ontologies [default: CC,MF,BP,KEGG]
-l LIM --limit=LIM remove categories where the number of genes are more than LIM percent of all genes [default: 0.25]
-x EXP --experiment=EXP newline-separated list of all the genes in the experiment
-v VRB --verbosity VRB options: quiet, info, debug [default: info]
-n CPU --ncpus=CPU number of cores used [default: 1]
Note:
The gene names need to be of the kind called "external_gene_name"/"associated gene name" in BioMart.
If they are in any other format, you need to use a BioMart id converter to convert them to this format.
(see BioMart.org or github.com/endrebak/biomartian)
Example
# input files (these can be found in the `examples` directory):
$ head -3 examples/Simes_Binner_ECIIvsDeep.txt
Ablim2
Elp5
Vmac
$ head -3 examples/FeatureCount_ECIIvsDeep.txt
Slc12a4
Lcat
Tle4
$ goverlap -a examples/FeatureCount_ECIIvsDeep.txt \ # first input list
-b examples/Simes_Binner_ECIIvsDeep.txt \ # second input list
-s rat \ # the species name; used for gene and ontology lookups
-v debug \ # how much output about the running process to print to stderr
-n 30 -o CC,KEGG \ # which ontologies to use
-l 0.10 \ # remove any ontology category where the number of associated genes is
\ # more than 10% of the total number of genes.
-x examples/experiment_genes.txt # restrict the genes considered to this list
This produces the output
# /local/home/endrebak/anaconda/bin/goverlap -a examples/FeatureCount_ECIIvsDeep.txt -b examples/Simes_Binner_ECIIvsDeep.txt -s rat -v debug -n 30 -o CC -l 0.1 -x examples/experiment_genes.txt
go_root ontology chi_sq_fdr a_hpgm_go_fdr a_odds b_hpgm_go_fdr b_odds a_I_b_I_go a_M_b_I_go b_M_a_I_go go_subtree U a_I_b a_M_b b_M_a chi_sq test_obs b_hpgm_go a_hpgm_go go_annotation
GO:0030425 CC 1.0 0.001 0.03 4.38e-11 0.05 184 84 7 369 15945 6170 4073 131 0.8 0.06 4.18e-13 4.04e-05 dendrite
GO:0000813 CC 1.0 1.0 0.0001 1.0 0.0 1 1 0 8 15945 6170 4073 131 0.9 5.3e-08 1.0 0.9 ESCRT I complex
GO:0034098 CC 1.0 1.0 0.0003 1.0 0.0004 2 1 0 6 1594 6170 4073 131 0.99 1.05e-07 0.57 0.7 VCP-NPL4-UFD1 AAA ATPase complex
GO:0034666 CC 1.0 0.7 0.0003 0.16 0.0009 2 0 0 2 15945 6170 4073 131 1.0 0.0 0.02 0.14 integrin alpha2-beta1 complex
GO:0033256 CC 1.0 0.15 0.001 0.34 0.001 6 3 0 7 15945 6170 4073 131 0.9 2.8e-06 0.05 0.01 I-kappaB/NF-kappaB complex
GO:0000308 CC 1.0 1.0 0.0002 0.9 0.0005 1 0 0 2 15945 6170 4073 131 1.0 0.0 0.2 0.6 cytoplasmic cyclin-dependent protein kinase holoenzyme complex
GO:0044308 CC 1.0 1.0 0.0002 1.0 0.0 1 1 0 2 15945 6170 4073 131 0.9 5.2e-08 1.0 0.6 axonal spine
GO:0032591 CC 1.0 0.3 0.0008 0.3 0.001 5 2 0 6 15945 6170 4073 131 0.9 1.05e-06 0.03 0.03 dendritic spine membrane
...
Requirements
Tested on both python
versions 2.7
and 3.4
.
All the required libraries are installed automatically when fetching goverlap from PyPI.
pip install goverlap # also installs pandas, numpy/scipy, joblib, godb, biomartian, kg, ebs
To run the tests you also need py.test
.
Tests
sh tests/run_unit_tests.sh
and sh tests/run_integration_tests.sh
.
Report any failing tests on the issues page.
Papers or studies using goverlap
Several high-profile papers to be added (will probably take years, though.)
To improve the readibility of the examples in this post I rounded all the decimal numbers to one or two digits. If you don't like it I can revert it back to the previous version.