Examples Of Full-Text Mining On The Pubmed Central Open Access Subset
5
20
Entering edit mode
13.2 years ago

In light of the recent 10 year anniversaries of PubMed Central and the PLoS Open Letter, I've been thinking about whether the promise of Open Access (OA) publishing has actually paid off in terms of high-throughput text-mining on the entire OA corpus. Given Peter Binfield's predictions that by 2016 ~50% of the literature could be published in OA journals, the ability to turn this promise into a reality is not far off. Nevertheless, I find it hard to cite examples where text-mining has actually been applied to the entire PMC corpus, which is the basis of my question:

Can you suggest papers that use full-text mining on all of Open Access subset of PubMed Central?

We've done some work in this are and I am aware of some relevant papers, e.g.:

Other suggestions are most welcome to fill gaps in my knowledge in this area.

I'll select the accepted answer for providing the most additional examples, or for anyone who can find an example (not listed above) that performs full-text mining on PMC OA supplemental files.

EDIT 23 SEP 2011

I've also received a few responses to this question by cross-posting on the bioNLP mailing list:

EDIT 5 OCT 2011

One more from the conference proceedings literature:

EDIT 30 OCT 2011

EDIT 7 NOV 2011

pubmed papers full-text • 12k views
ADD COMMENT
0
Entering edit mode

hmm, this paper is pre-dating PMC OA: http://www.biomedcentral.com/1471-2105/4/20

ADD REPLY
0
Entering edit mode

@Michael - yes there are a number of papers like this that use a selection of full-text articles from journal X, but not the entirety of PMC OA.

ADD REPLY
6
Entering edit mode
13.2 years ago
Anna Divoli ▴ 60

Hello Casey,

We also used OA for BioText:

BioText Search Engine: beyond abstract search. http://www.ncbi.nlm.nih.gov/pubmed/17545178

Regards, Anna

ADD COMMENT
0
Entering edit mode

Hi Anna, this is just the kind of paper I'm looking for. +1 and welcome to BioStar.

ADD REPLY
0
Entering edit mode

Hello Anna! BioText does not work on the full OA subset, is that right? On the web-site it says you provide access to more than 300 journals. According to this list here, http://www.ncbi.nlm.nih.gov/pmc/journals/?filter=t1#csvfile, there should be more than 1300 journals though. Looking at the directories created by a PubMed Central OA download, I even see more directories created (which contain supplementals though).

Could you please clarify this? Thanks..

ADD REPLY
0
Entering edit mode

Hi Joachim, At the time of the system's creation we used all available OA biomedical papers, thus demonstrating what one can do if all journals were OA. Ie, allowing search in full text including figures and tables. It was a project funded by a grant that ended though and although the site is maintained, it is not being updated. Regards, Anna

ADD REPLY
0
Entering edit mode

Ah. Okay. The usual problem. :) Thanks, Anna.

ADD REPLY
5
Entering edit mode
13.2 years ago

I've done some NLP work that leveraged the OA PMC subset. The particular application below used only a subset of PMC OA papers -- those that matched given title/abstract keywords -- but is generally applicable to the whole subset.

Using open access literature to guide full-text query formulation. http://precedings.nature.com/documents/4267/version/2

The results of the analysis were used in a recent paper: Heather A Piwowar (2011) Who Shares? Who Doesn't? Factors Associated with Openly Archiving Raw Research Data. http://www.plosone.org/article/info:doi/10.1371/journal.pone.0018657

ADD COMMENT
0
Entering edit mode

Thanks for these Heather, but as I tried to make clear above, I'm looking for large-scale applications over the entirety of PMC OA.

ADD REPLY
0
Entering edit mode

ok! yes, that was clear... thought it might be relevant because the code can be applied to an arbitrary and large subset of the PMC OA by changing the query parameters, and thereby help make your case that more is better. But I get it isn't what you want. Cheers.

ADD REPLY
5
Entering edit mode
13.2 years ago
Phil Bourne ▴ 50

A. Prlic, M.A. Martinez, B.T. Yukich, D. Dimitropoulos, B. Beran, P.W. Rose, P.E. Bourne, J.L. Fink 2010 Integration of Open Access Literature into the RCSB Protein Data Bank Using BioLit. BMC Bioinformatics 11:220 is an example where we associate PMC content with database content through mining for database identifiers (trivial I suppose by NLP standards). There are restful services that allow others to do the same.

Phil Bourne

ADD COMMENT
0
Entering edit mode

Thanks for this Phil. I had the original BioLit paper on the list above, but this newer application hadn't been tagged properly in my citeulike library. +1 and welcome to BioStar -- it's great to have a pioneer and leader in the field contributing to this community!

ADD REPLY
2
Entering edit mode
13.2 years ago
Joachim ★ 2.9k

Hello Casey!

You might be interested in this blog post: http://joachimbaran.wordpress.com/2011/07/25/opacmo-new/

There I am describing a text-mining solution that links PMC's OA subset to genes, species, diseases, cellular components, biological processes and molecular functions. The service is currently been bootstrapped, but since the blog-post I have already imported the complete PMC text-mining-run into the database and I am only working on last query optimisations now.

I will write a blog-post about the first release of opacmo soon, but I can already say that the preliminary numbers look like this:

Publications           144,759
Linked Entrez genes    200,050
Linked species          11,032
Linked ontology terms    9,559
  from gene ontology     6,495
  from disease ontology  3,064

Joachim

ADD COMMENT
0
Entering edit mode

Thanks for these Joachim, but I am looking for peer-reviewed journal articles.

ADD REPLY
0
Entering edit mode

Thanks for this Joachim, but I am looking for peer-reviewed journal articles.

ADD REPLY
0
Entering edit mode

Of course, sorry about that. I will get in touch with you when an opacmo paper gets accepted, which will probably be the case next year. Perhaps we can then compare our approaches, since you seem to be working in the same area. Let me know if you like to share something. You can find opacmo at https://github.com/joejimbo/opacmo (I will submit a change to deal with the new PubMed Central filename formatting this weekend.).

ADD REPLY
0
Entering edit mode
13.0 years ago
Sudeep ★ 1.7k

This answer might be a little late, but still have you looked into Biocreative 3 proceedings ? here, the gene name normalization task was entirely based on OA articles

ADD COMMENT
0
Entering edit mode

BCIII was restricted to small training and testing sets <<1000 documents (http://www.biomedcentral.com/1471-2105/12/S8/S1), not the entirety of the PMC OA subset.

ADD REPLY

Login before adding your answer.

Traffic: 2349 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6