Following on from a recent question posed by Khader and to help answer a recent query from a wet-lab researcher, I wanted to know more about which methods are available to conduct text mining based functional enrichment on a gene list or a set of abstracts derived from a gene list.
The goal of such an analysis would be to find a set of terms or concepts that are enriched in the gene list, similar to GO terms. I am aware that this problem has previously attracted some interest in the context of gene expression microarrays, but clearly is relevant for ChiP and GWAS studies as well.
So far I have dug out a few systems like MILANO and BeeSpace Genelist Analyzer, but I am aware this list is horribly incomplete. There appears to be a recent review on the topic, but I can't access it because of a paywall. Any advice on additional tools and their requirements/limitations/performance would be much appreciated.
I would suggest that you take a look at the Martini server, which is described in this paper. As far as I can see, what you want to do is exactly what Martini was designed for.
Full disclosure: I was involved in the development of this tool.
MARTINI looks great, it takes two gene lists and compares the enrichment of terms in linked Entrez gene references in both sets relative to one another. This system look like it fits the bill almost perfectly (only catch is that you need two lists rather than one) and has an API as well. +1 and many thanks.
I also like GRAIL (a new grail, not the old one). There's a more detailed look at GRAIL, and some references to other tools that you might want to look at in a longer post here.
EDIT as afterthought: I should probably also mention Textpresso. But one of the issues with that is that it relies on the set of papers that are loaded in, and if your species isn't one of the collections it wouldn't be as easy to use out-of-the-box. On the other hand, if your species or topic does have a collection, results might be more relevant.
ADD COMMENT
• link
updated 5.2 years ago by
Ram
44k
•
written 13.9 years ago by
Mary
11k
XplorMed just misses the mark, since while it finds associations between terms in a set of abstracts using text mining, the system relies on abstracts from a PubMed query or references linked to a database entry, but cannot take a gene list as input.
iHop uses text mining to tag and normalize gene names and provides a search interface to extract sentences and build gene networks, but it is neither a term extraction text mining tool, nor can it take a gene list as input.
Textpresso does not take a gene list as input and (on pre-defined corpora available at their website) only tags text using fixed ontologies, rather than discovering which terms are enriched in the document set.
GRAIL takes as input a list of Entrez gene IDs, HapMap SNP IDs or genomic regions containing one or more gene, and ranks each gene in each region based on similarity of terms in abstracts associated with each gene. The goal of the system is to prioritize the best candidate genes underlying a common (disease) process, however it does provide a list of the keywords that are most informative for the ranking. So +1 for GRAIL.
Oh, part of the Xplormed interface is gone...Look at the documentation: http://www.ogic.ca/projects/xplormed/example/index.html The pink entry point used to take IDs of various types. Huh. We'll have to write to them and see where the other pieces went.
Thanks for the suggestions. I was aware of both systems in other contexts, but just had a closer look to see if they fit the bill. AliBaba takes a PubMed query, not a gene list, and does extract terms and concepts, but does not perform enrichment analysis. It also looks like the AliBaba site is down currently though. Whatizit allows Uniprot IDs to be submitted, but no other gene IDs, and will tag/normalize entities, but not assess their enrichment in your input gene set. Based on your responses, I've tried to clarify my question a bit more.
Another option is GRAIL by Raychaudhuri S, Daly M, et al. (2009, Identifying Relationships among Genomic Disease Regions: Predicting Genes at Pathogenic SNP Associations and Rare Deletions PLoS Genetics).
GRAIL takes as input a list of Entrez gene IDs, HapMap SNP IDs or genomic regions containing one or more gene, and ranks each gene in each region based on similarity of terms in abstracts associated with each gene. The goal of the system is to prioritize the best candidate genes underlying a common (disease) process, however it does provide a list of the keywords that are most informative for the ranking. So +1 for GRAIL.
Great is not a text mining tool, nor does it accept a list of genes. Great takes an input set of genomic coordinates that are associated to neighboring genes in order to find enrichment of terms in a pre-defined ontology, rather than discovering those terms de novo by text mining.
MARTINI looks great, it takes two gene lists and compares the enrichment of terms in linked Entrez gene references in both sets relative to one another. This system look like it fits the bill almost perfectly (only catch is that you need two lists rather than one) and has an API as well. +1 and many thanks.
Yes, I recall hearing about Martini at conference and came away impressed. BioStar Q&As jog the memory...