How To Find Best Matching Pmid Given A Blob Of Text
2
7
Entering edit mode
13.4 years ago

I've recently been looking into the text corpora of the BioCreative I Challenge for a project using gene name normalization text mining methods, and found that (unfortunately) the training and testing data consist of abstracts that labelled by internal BioCreative IDs (e.g. fly00001training.txt, fly00001testing.txt) instead of standard PMIDs.

This problem raises the more general question of how to find the best matching PMID in PubMed given a blob of text like the following:

Dorsal-ventral patterning within the ectodermal and mesodermal germ layers of Drosophila and Xenopus embryos is specified by a system of genes that has been conserved over 500 million years of evolution. In both organisms, the activity of the TGF-beta family member DPP/BMP4 is antagonized by SOG/CHORDIN. A second Xenopus gene, noggin, has a similar biological activity to chordin. Analysis of the action of these genes indicate that Spemann's organizer promotes dorsal cell fates in Xenopus by antagonizing a ventralizing signal encoded by the Bmp4 gene.

Entering this into the PubMed search interface pulls up only one PMID (8791529), which is an exact match to the text, and clearly the correct answer. But I've had no luck with using standard eutils queries or the JANE API to do this because they are choking on common "stop" words in different ways.

A solution that uses a remote web service would be preferred since there are only a few hundred BC I abstracts to map to PMIDS.

Many thanks, Casey

text pubmed • 3.0k views
ADD COMMENT
0
Entering edit mode

For those interested in this particular problem about, @Nathan Harmston has kindly provided a look-up table between BC I and PMIDs here.

ADD REPLY
4
Entering edit mode
13.4 years ago

But I've had no luck with using standard eutils queries

Casey, are you sure about NCBI-eUtils ? I got only one result too with eSearch (PMID:8791529 )...

curl -L http://goo.gl/4npgC

<?xml version="1.0"?>
<!DOCTYPE eSearchResult PUBLIC "-//NLM//DTD eSearchResult, 11 May 2002//EN" "htt
p://www.ncbi.nlm.nih.gov/entrez/query/DTD/eSearch_020511.dtd">
<eSearchResult>
  <Count>1</Count>
  <RetMax>1</RetMax>
  <RetStart>0</RetStart>
  <IdList>
    <Id>8791529</Id>
  </IdList>
(...)
ADD COMMENT
0
Entering edit mode

OK, this looks very promising. The difference between your query and mine was to include the blob in "double quotes". I've tried with a couple other longer abstracts and I'm getting a "Bad Gateway!" error, which goes away when I truncate the abstract text. The truncated text gets the correct PMID, so it may be matter of finding the upper limit on the length of string that can be passed to eutils and wrapping in double quotes. Many thanks!

ADD REPLY
0
Entering edit mode

Depending on which type of query you are using, this could be a URL encoding issue instead of a character limit as well.

ADD REPLY
1
Entering edit mode
13.3 years ago
Yogesh Pandit ▴ 520

BioPython supports this

from Bio import Entrez

handle = Entrez.esearch(db="pubmed", retmax=10, term="Dorsal-ventral patterning within the ectodermal and mesodermal germ layers of Drosophila and Xenopus embryos is specified by a system of genes that has been conserved over 500 million years of evolution. In both organisms, the activity of the TGF-beta family member DPP/BMP4 is antagonized by SOG/CHORDIN. A second Xenopus gene, noggin, has a similar biological activity to chordin. Analysis of the action of these genes indicate that Spemann's organizer promotes dorsal cell fates in Xenopus by antagonizing a ventralizing signal encoded by the Bmp4 gene.")
record = Entrez.read(handle)
print record["Count"]
print record["IdList"]

The output u get is

1
['8791529']
ADD COMMENT

Login before adding your answer.

Traffic: 2364 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6