Question

Examples Of Full-Text Mining On The Pubmed Central Open Access Subset

20

Entering edit mode

13.2 years ago

Casey Bergman 18k

In light of the recent 10 year anniversaries of PubMed Central and the PLoS Open Letter, I've been thinking about whether the promise of Open Access (OA) publishing has actually paid off in terms of high-throughput text-mining on the entire OA corpus. Given Peter Binfield's predictions that by 2016 ~50% of the literature could be published in OA journals, the ability to turn this promise into a reality is not far off. Nevertheless, I find it hard to cite examples where text-mining has actually been applied to the entire PMC corpus, which is the basis of my question:

Can you suggest papers that use full-text mining on all of Open Access subset of PubMed Central?

We've done some work in this are and I am aware of some relevant papers, e.g.:

Annotating genes and genomes with DNA sequences extracted from biomedical articles. http://www.ncbi.nlm.nih.gov/pubmed/21325301
Figure text extraction in biomedical literature. http://www.ncbi.nlm.nih.gov/pubmed/21249186
LINNAEUS: a species name identification system for biomedical literature. http://www.ncbi.nlm.nih.gov/pubmed/20149233
Systematic Characterizations of Text Similarity in Full Text Biomedical Publications http://www.ncbi.nlm.nih.gov/pubmed/20856807
Yale Image Finder (YIF): a new search engine for retrieving biomedical images. http://www.ncbi.nlm.nih.gov/pubmed/18614584
BioLit: integrating biological literature with databases. http://www.ncbi.nlm.nih.gov/pubmed/18515836
Figure mining for biomedical research. http://www.ncbi.nlm.nih.gov/pubmed/19439564

Other suggestions are most welcome to fill gaps in my knowledge in this area.

I'll select the accepted answer for providing the most additional examples, or for anyone who can find an example (not listed above) that performs full-text mining on PMC OA supplemental files.

EDIT 23 SEP 2011

I've also received a few responses to this question by cross-posting on the bioNLP mailing list:

Author keywords in biomedical journal articles. http://www.ncbi.nlm.nih.gov/pubmed/21347036 (credit to Aurélie Névéol)
UKPMC: a full text article resource for the life sciences. http://www.ncbi.nlm.nih.gov/pubmed/21062818 (credit to C.J. Rupp)

EDIT 5 OCT 2011

One more from the conference proceedings literature:

An exploration of mining gene expression mentions and their anatomical locations from biomedical text http://dl.acm.org/citation.cfm?id=1869970 (credit to Martin Gerner)

EDIT 30 OCT 2011

Extraction of data deposition statements from the literature: a method for automatically tracking research results. http://www.ncbi.nlm.nih.gov/pubmed/21998156 (credit to Google)
BioNOT: A searchable database of biomedical negated sentences. http://www.ncbi.nlm.nih.gov/pubmed/22032181 (credit to BMC TOC)

EDIT 7 NOV 2011

Intrinsic evaluation of text mining tools may not predict performance on realistic tasks. http://www.ncbi.nlm.nih.gov/pubmed/18229722 (credit to @maximilianh)

pubmed papers full-text • 12k views

ADD COMMENT • link updated 2.3 years ago by Ram 44k • written 13.2 years ago by Casey Bergman 18k

0

Entering edit mode

hmm, this paper is pre-dating PMC OA: http://www.biomedcentral.com/1471-2105/4/20

ADD REPLY • link updated 5.2 years ago by Ram 44k • written 13.2 years ago by Michael Kuhn 5.0k

0

Entering edit mode

@Michael - yes there are a number of papers like this that use a selection of full-text articles from journal X, but not the entirety of PMC OA.

ADD REPLY • link 13.2 years ago by Casey Bergman 18k

Ram · Answer 1 · 2011-09-21

6

Entering edit mode

13.2 years ago

Anna Divoli ▴ 60

Hello Casey,

We also used OA for BioText:

BioText Search Engine: beyond abstract search. http://www.ncbi.nlm.nih.gov/pubmed/17545178

Regards, Anna

ADD COMMENT • link 13.2 years ago by Anna Divoli ▴ 60

0

Entering edit mode

Hi Anna, this is just the kind of paper I'm looking for. +1 and welcome to BioStar.

ADD REPLY • link 13.2 years ago by Casey Bergman 18k

0

Entering edit mode

Hello Anna! BioText does not work on the full OA subset, is that right? On the web-site it says you provide access to more than 300 journals. According to this list here, http://www.ncbi.nlm.nih.gov/pmc/journals/?filter=t1#csvfile, there should be more than 1300 journals though. Looking at the directories created by a PubMed Central OA download, I even see more directories created (which contain supplementals though).

Could you please clarify this? Thanks..

ADD REPLY • link updated 5.2 years ago by Ram 44k • written 13.2 years ago by Joachim ★ 2.9k

0

Entering edit mode

Hi Joachim, At the time of the system's creation we used all available OA biomedical papers, thus demonstrating what one can do if all journals were OA. Ie, allowing search in full text including figures and tables. It was a project funded by a grant that ended though and although the site is maintained, it is not being updated. Regards, Anna

ADD REPLY • link 13.2 years ago by Anna Divoli ▴ 60

0

Entering edit mode

Ah. Okay. The usual problem. :) Thanks, Anna.

ADD REPLY • link 13.2 years ago by Joachim ★ 2.9k

score 5 · Answer 2 · 2011-09-20

5

Entering edit mode

13.2 years ago

Heather Piwowar ▴ 380

I've done some NLP work that leveraged the OA PMC subset. The particular application below used only a subset of PMC OA papers -- those that matched given title/abstract keywords -- but is generally applicable to the whole subset.

Using open access literature to guide full-text query formulation. http://precedings.nature.com/documents/4267/version/2

The results of the analysis were used in a recent paper: Heather A Piwowar (2011) Who Shares? Who Doesn't? Factors Associated with Openly Archiving Raw Research Data. http://www.plosone.org/article/info:doi/10.1371/journal.pone.0018657

ADD COMMENT • link 13.2 years ago by Heather Piwowar ▴ 380

0

Entering edit mode

Thanks for these Heather, but as I tried to make clear above, I'm looking for large-scale applications over the entirety of PMC OA.

ADD REPLY • link 13.2 years ago by Casey Bergman 18k

0

Entering edit mode

ok! yes, that was clear... thought it might be relevant because the code can be applied to an arbitrary and large subset of the PMC OA by changing the query parameters, and thereby help make your case that more is better. But I get it isn't what you want. Cheers.

ADD REPLY • link 13.2 years ago by Heather Piwowar ▴ 380

score 5 · Answer 3 · 2011-09-22

5

Entering edit mode

13.2 years ago

Phil Bourne ▴ 50

A. Prlic, M.A. Martinez, B.T. Yukich, D. Dimitropoulos, B. Beran, P.W. Rose, P.E. Bourne, J.L. Fink 2010 Integration of Open Access Literature into the RCSB Protein Data Bank Using BioLit. BMC Bioinformatics 11:220 is an example where we associate PMC content with database content through mining for database identifiers (trivial I suppose by NLP standards). There are restful services that allow others to do the same.

Phil Bourne

ADD COMMENT • link 13.2 years ago by Phil Bourne ▴ 50

0

Entering edit mode

Thanks for this Phil. I had the original BioLit paper on the list above, but this newer application hadn't been tagged properly in my citeulike library. +1 and welcome to BioStar -- it's great to have a pioneer and leader in the field contributing to this community!

ADD REPLY • link 13.2 years ago by Casey Bergman 18k

Ram · Answer 4 · 2011-09-20

2

Entering edit mode

13.2 years ago

Joachim ★ 2.9k

Hello Casey!

You might be interested in this blog post: http://joachimbaran.wordpress.com/2011/07/25/opacmo-new/

There I am describing a text-mining solution that links PMC's OA subset to genes, species, diseases, cellular components, biological processes and molecular functions. The service is currently been bootstrapped, but since the blog-post I have already imported the complete PMC text-mining-run into the database and I am only working on last query optimisations now.

I will write a blog-post about the first release of opacmo soon, but I can already say that the preliminary numbers look like this:

Publications           144,759
Linked Entrez genes    200,050
Linked species          11,032
Linked ontology terms    9,559
  from gene ontology     6,495
  from disease ontology  3,064

Joachim

ADD COMMENT • link updated 5.2 years ago by Ram 44k • written 13.2 years ago by Joachim ★ 2.9k

0

Entering edit mode

Thanks for these Joachim, but I am looking for peer-reviewed journal articles.

ADD REPLY • link 13.2 years ago by Casey Bergman 18k

0

Entering edit mode

Thanks for this Joachim, but I am looking for peer-reviewed journal articles.

ADD REPLY • link 13.2 years ago by Casey Bergman 18k

0

Entering edit mode

Of course, sorry about that. I will get in touch with you when an opacmo paper gets accepted, which will probably be the case next year. Perhaps we can then compare our approaches, since you seem to be working in the same area. Let me know if you like to share something. You can find opacmo at https://github.com/joejimbo/opacmo (I will submit a change to deal with the new PubMed Central filename formatting this weekend.).

ADD REPLY • link 13.2 years ago by Joachim ★ 2.9k

Ram · Answer 5 · 2011-11-28

0

Entering edit mode

13.0 years ago

Sudeep ★ 1.7k

This answer might be a little late, but still have you looked into Biocreative 3 proceedings ? here, the gene name normalization task was entirely based on OA articles

ADD COMMENT • link 13.0 years ago by Sudeep ★ 1.7k

0

Entering edit mode

BCIII was restricted to small training and testing sets <<1000 documents (http://www.biomedcentral.com/1471-2105/12/S8/S1), not the entirety of the PMC OA subset.

ADD REPLY • link updated 5.2 years ago by Ram 44k • written 13.0 years ago by Casey Bergman 18k