How Do People Go About Pubmed Text Mining
6
29
Entering edit mode
13.4 years ago
Lyco ★ 2.3k

I am asking this purely out of curiosity (I have no plans to actually do this), but I wonder how the various groups who analyse Pubmed abstracts for co-citation or gene-interaction (or anything else) actually access the data ?

Do they run a lot of batch pubmed searches and parse the results, do they access the pubmed database via some kind of web-service API, or is there a way of downloading the entire corpus and do the analyses locally ?

Assuming that there are several possibilities, what would be the preferred way ?

pubmed text • 38k views
ADD COMMENT
24
Entering edit mode
13.4 years ago

I think the answer depends on the scale of your problem. If you want to analyze hundreds/thousands of documents, the use NCBI eutils to fetch documents from PubMed. If you have to do hardcore text/data-mining on millions of documents, you'll need to get a local copy of MEDLINE and PubMed Central. For MEDLINE, this involves getting a license. For PubMed Central, you can download the Open Access subset without a license by ftp.

We use local copies of MEDLINE and PMC for our text-mining work, and access text either through SQL databases (MEDLINE, PMC) or somtimes the filesystem (PMC w/supplements). MEDLINE is fairly straightforward to work with, although a common gotcha is that many citations have no abstracts so you need to acocunt for this in any quantification of your results. PMC is much more difficult since text comes in XML, text and pdf, and supplemental files come in an bewildering diversity of forms. A PMC gotcha is that not all PMC documents are in Pubmed and quantification must extrapolate from the 1% of the literature that is PMC OA to the totality of Pubmed.

There are published methods for transforming MEDLINE into a SQL database, which are likely out of date. It is probably a good question to post on BioStar about the best method to parse MEDLINE into a SQL DB (code golf anyone?). Lars has reformatted PMC OA here, and it would probably bee good to get his advice/code on how best to do this.

ADD COMMENT
5
Entering edit mode

With regard to databases, it's very easy to parse PubMed XML into a hash and just drop it into a document-oriented database, such as MongoDB. This is the approach I used for my PubMed retractions project: https://github.com/neilfws/PubMed.

ADD REPLY
1
Entering edit mode

Great, thanks for the thoughtful answer. I learned a lot.

ADD REPLY
4
Entering edit mode
13.4 years ago
Scott Cain ▴ 770

You could try Textpresso:

http://gmod.org/wiki/Textpresso http://www.textpresso.org/

which is a tool for analyzing whatever corpus you feed it. It knows about biological terms, so you can search for things like "gene A suppress gene B" and will do a semantic search of the corpus for that, and then return full sentences that support the result. You can look at running examples for E. coli here:

http://ecocyc.org/ecocyc/textpresso.shtml

and an older version for C. elegans here:

http://www.textpresso.org/celegans/

ADD COMMENT
3
Entering edit mode
13.4 years ago

We actually published what we did [?]here[?]. You can also do less complicated but interesting things like what we did [?] here[?] (sorry not open access).

Updated in response to the comments.

We downloaded content from Pubmed (using the license for full but actually the abstracts would have done). Our approach is really content directed that is central to the way we build the corpus.

We first tokenised the text (you could say break into words) and then combined the words into combined tokens that are lemma's. These meaningful terms; glutathion S-transferase would be one token, not 2 or 3 and a lemma should cover all synonyms. Being able to find the lemma's is one of the reasons semantic web approaches like the concept web are so important. We counted the occurrence of those in each individual abstract.

Next we used a set of publications that we knew to be relevant for the topic we were after (carotenoids). We counted all tokens in that set of texts (our initial corpus) by asking experts and taking the references from the existing pathway. We then created a vectors of tokens that occurred typically in that corpus using the counts mentioned above. In essence the difference between that vector and the vector describing all of Pubmed determines what is specific for your start corpus (the texts known to be about your topic). After that we compared that vector with every individual vector for all abstracts from Pubmed to find the ones that had a descriptive vector of tokens that is close to the one describing our start corpus. Matching texts were added to the corpus. In other words we added the papers that contained the same distribution of words like the ones we already had to our corpus. So we did not use all of Pubmed, only the relevant papers, but we found the relevant ones using an automated procedure.

Next we used these same lists of lemmatized tokens to find terms over represented in our new extended corpus that we didn't already have in the pathway we wanted to extend and had the result judged by experts.

ADD COMMENT
1
Entering edit mode

Wait, you mean, I have to read a paper ?? just kidding, thanks a lot!

ADD REPLY
0
Entering edit mode

Wait, you mean, I have to read a paper ?? just kidding, thanks a lot! – Lyco 0 secs ago

ADD REPLY
0
Entering edit mode

Ok, I checked you paper and see that you use the 'do pubmed-search first, anlyse-text later' method, because you were only interested in a particular topic. Would you do the same if you were to parse e.g. "all regulatory interations beween human proteins"? By the way, I love the term 'lemmatized token', even if I have no idea what it is.

ADD REPLY
0
Entering edit mode

I have added a short description of the method to text. As you can see the method starts with a small corpus specific for the domain. So it really makes no sense to apply it to very broad questions.

ADD REPLY
3
Entering edit mode
13.4 years ago
Pablo Pareja ★ 1.6k

Hi!

If you're interested in Pubmed/citations information around proteins you can have a look at Bio4j open-source project.

Regarding citations information you can find the info Uniprot(Swiss-Prot + Trembl) provides, which basically goes around:

  • Articles
  • Online articles
  • Thesis
  • Books
  • Submissions
  • Unpublished observations

Here you can see a model of the entities implemented regarding citations, (if you click in the shapes/links you'll be redirected to the corresponding classes).

Besides, since much more information is included in Bio4j and everything's linked together (it's a graph DB), you can take advantage of any extra information connected to any of the entities involved in your query/study.

Cheers,
Pablo

ADD COMMENT
3
Entering edit mode
13.4 years ago
Gareth Palidwor ★ 1.6k

I recommend contacting NCBI for a copy of the XML data rather than screen scraping the site. The license is quite reasonable and quick to get for academic use.

To do their analysis, most groups I've seen grind through the XML with scripts to extract/preprocess what they need.

Recently I've done some messing about with Apache Lucene to index the data for lightning fast searching and extraction. Lucene is very fast (for example http://www.ogic.ca/mltrends/) at text searching.

ADD COMMENT
1
Entering edit mode

I've heard of people using BioRuby to specifically do this.

ADD REPLY
3
Entering edit mode
ADD COMMENT

Login before adding your answer.

Traffic: 2461 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6