Question

Text Mining Based Functional Enrichment Of Gene Lists

9

Entering edit mode

14.0 years ago

Casey Bergman 18k

Following on from a recent question posed by Khader and to help answer a recent query from a wet-lab researcher, I wanted to know more about which methods are available to conduct text mining based functional enrichment on a gene list or a set of abstracts derived from a gene list.

The goal of such an analysis would be to find a set of terms or concepts that are enriched in the gene list, similar to GO terms. I am aware that this problem has previously attracted some interest in the context of gene expression microarrays, but clearly is relevant for ChiP and GWAS studies as well.

So far I have dug out a few systems like MILANO and BeeSpace Genelist Analyzer, but I am aware this list is horribly incomplete. There appears to be a recent review on the topic, but I can't access it because of a paywall. Any advice on additional tools and their requirements/limitations/performance would be much appreciated.

genomics text gene functional enrichment • 6.3k views

ADD COMMENT • link updated 11.4 years ago by Biostar 20 • written 14.0 years ago by Casey Bergman 18k

score 5 · Answer 1 · 2011-01-05

5

Entering edit mode

14.0 years ago

Lars Juhl Jensen 11k

I would suggest that you take a look at the Martini server, which is described in this paper. As far as I can see, what you want to do is exactly what Martini was designed for.

Full disclosure: I was involved in the development of this tool.

ADD COMMENT • link 14.0 years ago by Lars Juhl Jensen 11k

1

Entering edit mode

MARTINI looks great, it takes two gene lists and compares the enrichment of terms in linked Entrez gene references in both sets relative to one another. This system look like it fits the bill almost perfectly (only catch is that you need two lists rather than one) and has an API as well. +1 and many thanks.

ADD REPLY • link 14.0 years ago by Casey Bergman 18k

0

Entering edit mode

Yes, I recall hearing about Martini at conference and came away impressed. BioStar Q&As jog the memory...

ADD REPLY • link 14.0 years ago by Larry_Parnell 16k

Ram · Answer 2 · 2011-01-05

4

Entering edit mode

14.0 years ago

Mary 11k

Here's a couple we like:

I also like GRAIL (a new grail, not the old one). There's a more detailed look at GRAIL, and some references to other tools that you might want to look at in a longer post here.

EDIT as afterthought: I should probably also mention Textpresso. But one of the issues with that is that it relies on the set of papers that are loaded in, and if your species isn't one of the collections it wouldn't be as easy to use out-of-the-box. On the other hand, if your species or topic does have a collection, results might be more relevant.

ADD COMMENT • link updated 5.3 years ago by Ram 44k • written 14.0 years ago by Mary 11k

0

Entering edit mode

Yes, iHop is good.

ADD REPLY • link 14.0 years ago by Larry_Parnell 16k

0

Entering edit mode

XplorMed just misses the mark, since while it finds associations between terms in a set of abstracts using text mining, the system relies on abstracts from a PubMed query or references linked to a database entry, but cannot take a gene list as input.

ADD REPLY • link 14.0 years ago by Casey Bergman 18k

0

Entering edit mode

iHop uses text mining to tag and normalize gene names and provides a search interface to extract sentences and build gene networks, but it is neither a term extraction text mining tool, nor can it take a gene list as input.

ADD REPLY • link 14.0 years ago by Casey Bergman 18k

0

Entering edit mode

Textpresso does not take a gene list as input and (on pre-defined corpora available at their website) only tags text using fixed ontologies, rather than discovering which terms are enriched in the document set.

ADD REPLY • link 14.0 years ago by Casey Bergman 18k

0

Entering edit mode

GRAIL takes as input a list of Entrez gene IDs, HapMap SNP IDs or genomic regions containing one or more gene, and ranks each gene in each region based on similarity of terms in abstracts associated with each gene. The goal of the system is to prioritize the best candidate genes underlying a common (disease) process, however it does provide a list of the keywords that are most informative for the ranking. So +1 for GRAIL.

ADD REPLY • link 14.0 years ago by Casey Bergman 18k

0

Entering edit mode

See Comments on Larry's post about GRAIL, which does seem to fit the bill. +1 and many thanks.

ADD REPLY • link 14.0 years ago by Casey Bergman 18k

0

Entering edit mode

Oh, part of the Xplormed interface is gone...Look at the documentation: http://www.ogic.ca/projects/xplormed/example/index.html The pink entry point used to take IDs of various types. Huh. We'll have to write to them and see where the other pieces went.

ADD REPLY • link 14.0 years ago by Mary 11k

score 3 · Answer 3 · 2011-01-05

3

Entering edit mode

14.0 years ago

Sudeep ★ 1.7k

Perhaps you could also try AliBaba by University of Berlin or Whatizit by EBI Rebholz-Schuhmann group

ADD COMMENT • link 14.0 years ago by Sudeep ★ 1.7k

0

Entering edit mode

Thanks for the suggestions. I was aware of both systems in other contexts, but just had a closer look to see if they fit the bill. AliBaba takes a PubMed query, not a gene list, and does extract terms and concepts, but does not perform enrichment analysis. It also looks like the AliBaba site is down currently though. Whatizit allows Uniprot IDs to be submitted, but no other gene IDs, and will tag/normalize entities, but not assess their enrichment in your input gene set. Based on your responses, I've tried to clarify my question a bit more.

ADD REPLY • link 14.0 years ago by Casey Bergman 18k

score 2 · Answer 4 · 2011-01-05

2

Entering edit mode

14.0 years ago

Larry_Parnell 16k

Another option is GRAIL by Raychaudhuri S, Daly M, et al. (2009, Identifying Relationships among Genomic Disease Regions: Predicting Genes at Pathogenic SNP Associations and Rare Deletions PLoS Genetics).

(Mary and I are answering simultaneously!)

ADD COMMENT • link 14.0 years ago by Larry_Parnell 16k

0

Entering edit mode

GRAIL takes as input a list of Entrez gene IDs, HapMap SNP IDs or genomic regions containing one or more gene, and ranks each gene in each region based on similarity of terms in abstracts associated with each gene. The goal of the system is to prioritize the best candidate genes underlying a common (disease) process, however it does provide a list of the keywords that are most informative for the ranking. So +1 for GRAIL.

ADD REPLY • link 14.0 years ago by Casey Bergman 18k

score 0 · Answer 5 · 2011-01-06

0

Entering edit mode

14.0 years ago

Puthier ▴ 250

GREAT

ADD COMMENT • link 14.0 years ago by Puthier ▴ 250

1

Entering edit mode

Great is not a text mining tool, nor does it accept a list of genes. Great takes an input set of genomic coordinates that are associated to neighboring genes in order to find enrichment of terms in a pre-defined ontology, rather than discovering those terms de novo by text mining.

ADD REPLY • link 14.0 years ago by Casey Bergman 18k