Gene-Name Recognition Webservice Or Downloadable Tool
2
3
Entering edit mode
13.7 years ago
Will 4.6k

I'm looking for a tool (either webservice or downloadable code) which can identify and disambiguate gene mentions in free text. I need the disambiguation to attribute the gene mention to a database ID (any database will do, I can convert between them). I need to annotate genes from multiple species (many are non-human).

I've been using the Whatizit tool from EBI but its having difficulty detecting the organism ... If you want to take a look I've got my analysis code on github.

Thanks

gene text • 4.3k views
ADD COMMENT
1
Entering edit mode

@brent: I saw that post but doing simple string matching with just gene synonyms is very ineffective.

ADD REPLY
0
Entering edit mode

Also look at this question.

ADD REPLY
1
Entering edit mode
13.7 years ago
Joachim ★ 2.9k

You can try GNAT, http://cbioc.eas.asu.edu/gnat/, which is one of many gene name recognition tools. However, it is tweaked for efficiency, so that you can go through a lot of documents in very short time. It can also interface to the species recognition tool LINNAEUS, http://linnaeus.sourceforge.net/, which can guide GNAT to pick the right species for each gene mention.

There is also a web-site that runs a competition between recognition tools: http://www.biocreative.org/ If you manage to wade through their extremely horrible web-site, you should find a very rich and detailed analysis of the quality of more than a dozen gene name recognition tools.

ADD COMMENT
0
Entering edit mode

I know of biocreative ... and the code is usually good. But I find that they usually require a pre-tagged corpus and they don't always include one in the distribution.

I'll try out GNAT though, thanks for the link.

ADD REPLY
0
Entering edit mode

Just tried the webserver, doesn't seem to actually work (Just gives an unhelpful error message). Also, since I need to annotate many species (ie. all bacterial species) this won't really work. I do like the idea of merging LINNAEUS with the Whatizit tool to help disambiguate.

ADD REPLY
0
Entering edit mode

GNAT is actually a standalone tool, so I would ignore the web-service. At my previous lab, we used the GNAT+LINNAEUS combo over MEDLINE for the 16 default species. It should be no problem to switch to different sets of species though -- because that's what the tools are written for. It is more a question of available memory to which gene/species dictionaries you would like to load. You might need to divide and conquer if you cannot load all dictionaries in one go...

ADD REPLY
0
Entering edit mode

Okay, I'll take a closer look at the stand-alone version.

ADD REPLY
0
Entering edit mode

We have the same experience. It is easier (although not without problems) to get the right gene than it is to get the right species, especially if there is more than one species in the paper. I would like to hear your experiences with LINNAEUS.

ADD REPLY
0
Entering edit mode

Hi Chris! LINNAEUS is fast, but it does only simple string matching. This works alright due to the latin species names, so there is little chance that a species name could be confused with something else. Of course, this approach fails for common species names. You can have a look at the string matching performance between LINNAEUS and a 172-line Ruby script that I have written: http://joachimbaran.wordpress.com/2010/11/01/named-entity-recognition-mesh-in-medline/

ADD REPLY
1
Entering edit mode
13.7 years ago

We have been doing some large scale gene name recognition/normalization work in the past couple years and chose to use Jörg Hakenberg's GNAT for our projects. The link Joachim points to is out of date and this webservice is no longer in use, but there is now a distribution of GNAT on sourceforge that you can play with. We are also now hosting a webservice for GNAT, which is still in development and is currently only serving gene and species entity recognition for human, mouse and Drosophila. Feedback on how this service is running would be much appreciated. In general, GNAT limitations are that it requires memory intensive species-specific gene name dictionaries, so if you are trying to normalize gene names simultaneously for a large number of species, you'll need a high memory machine.

Another system we are working with now is GeneTuKit, which by some measures was the top ranking system in the latest BioCreative challenge. A distribution of GenetuKit can be found here. From what I understand from the student doing this work in my lab (Martin Gerner), this system is running well, but you may want to contact the authors or him for more details.

Finally, from your GitHub, it seems that your ultimate aim is mutation detection. Apologies if you are aware of this already, but there are several mutation detection text mining systems that have already been developed that you may want to take a look at first or compare your system to.

ADD COMMENT
2
Entering edit mode

I've gone through the papers you've mentioned before starting this project (I hate re-inventing the wheel) ... but for various reasons have gotten only one to work properly (MutationFinder from Caporaso et. all for mutation detection already). The MEMA paper only provides a methodology (the link in the paper only refers to a database of extracted mentions and not actual code or webservice). CoagMB and MuGeX don't actually have code attached and their webservices can't accept arbitrary text. And VTag has suffered from link-death.

If you working code for any I'd be glad to see them.

ADD REPLY
1
Entering edit mode

I'm not really trying for a killer app ... this is just something to make other parts of my literature search more complete and "unbiased"

ADD REPLY
0
Entering edit mode

Nope, we've never tried these systems in the lab ourselves, so thanks for the status report. Sounds like there is still some room for the killer app!

ADD REPLY

Login before adding your answer.

Traffic: 1285 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6