Web Tool That Converts A Pubmed Query Into A Wordle Of The Abstracts
3
12
Entering edit mode
14.3 years ago
Andrew Su 4.9k

I would like to use a tool that will take any valid pubmed query, extract the abstracts, and create a wordle diagram. Ideally it would be a one-click operation that automatically posts to the wordle site, but simply outputting text that can be copied and pasted would be okay too.

Does such a tool exist? If not, what would be your strategy for implementing it? If such a tool does not exist, I will offer a bounty of 150 points for anyone who implements it (awarded to the best working solution if multiple are offered). Ideally the solution would be hosted on Google App Engine, but code in a public code repository would be acceptable.

EDIT: Sorry, got impatient and implemented it myself at http://pubmed2wordle.appspot.com/. Winning answer to Lars for the outline and especially the wordle advanced link. Also, I put the code on Google code, in case anyone else wants to work on any of the potential enhancements.

pubmed • 11k views
ADD COMMENT
8
Entering edit mode
14.3 years ago

If you can live with the Wordle being based on only the first, for example, 200 abstract returned from the PubMed query, it should not be so difficult to do. What I would do is the following:

  1. Use the NCBI eutils ESearch method to retrieve a list of PMIDs that match your query.
  2. Use the NCBI eutils EFetch method to retrieve the abstracts for this set of PMIDs
  3. Concatenate the abstracts and use an HTTP POST request to submit it to Wordle using its advanced interface.

This solution could be implemented on Google App Engine without too much trouble.

If you want to make a Wordle cloud that is based on all PubMed abstracts that match your query, it is much harder since you cannot count on being able to retrieve the abstracts via NCBI eutils, and since the total amount of text could be too much to submit to Wordle. What I would do in that case would be:

  1. Again use NCBI eutils ESearch to retrieve the (possibly long) list of PMIDs.
  2. Retrieve the abstracts from a local, indexed copy of Medline.
  3. Calculate all the word frequencies myself in a hash table.
  4. Filter down the set of counts to include only the N most frequent words.
  5. Submit the resulting counts to Wordle using its advanced interface.

This solution would obviously be far more work to implement and would require that you maintain a local mirror of Medline. Due to the amount of data involved, I don't see this solution running on Google App Engine.

ADD COMMENT
1
Entering edit mode

It also took me a little while to spot the advanced link. Very nice implementation - you might be able to work around the GAE timeout issues by downloading abstracts in a few chunks. GAE has a timeout of 10 seconds on HTTP requests, so chopping the data transfer into several smaller requests should work.

ADD REPLY
1
Entering edit mode

I'm quite confident. You are right that GAE also times out, but that is one is 30 seconds whereas HTTP requests from within GAE time out after just 10 seconds. So you should be able to handle more abstracts by cutting it into chunks, but only by a factor of 2-3.

ADD REPLY
1
Entering edit mode

Lars, you were indeed right. http://pubmed2wordle.appspot.com/ looks like it handles up to 500 pubs quite reliably now. Probably it could be increased a bit more by tuning the number vs size of requests, but 500 seems pretty good to me. Anyway, upvotes to you all around!

ADD REPLY
0
Entering edit mode

The advanced link at Wordle.net -- hidden in plain sight! And to think I was just digging around in the JS source trying to figure out how I might hack it...

ADD REPLY
0
Entering edit mode

Okay, every once in a while I have to convince myself that I can program . So check it out: http://pubmed2wordle.appspot.com. (Doesn't work for huge queries due to GAE timeout issues, I think. executing on my localhost works fine...)

ADD REPLY
0
Entering edit mode

Incidentally, I tried to make one of the example queries as a hat tip to Lars, but sadly (or not) he's got too many publications...

ADD REPLY
0
Entering edit mode

Hmm, how confident are you of that? I assumed GAE times out on the parent request (from you to GAE), so splitting up the child requests (from GAE to eutils) wouldn't help. But my assumption could be wrong...

ADD REPLY
0
Entering edit mode

Thank you - I'll go play with your tool myself now :)

ADD REPLY
5
Entering edit mode
14.3 years ago

As an alternative to the advanced form, you can download a command-line version of the Wordle engine from IBM, and then generate the tag-clouds on your own server.

You can also supply a file of additional stopwords that you don't want to appear in the pictures.

ADD COMMENT
0
Entering edit mode

wow, fantastic! Interesting that there are no reciprocal links between the two sites. Anyway, will definitely give this a try!

ADD REPLY
2
Entering edit mode
14.3 years ago
Treylathe ▴ 950

Well, this isn't exactly what you are asking for, but perhaps a step in the right direction? XplorMed takes a pubmed query and gives you the main associations between the words in groups of abstracts. So, you'd get the words most used and associated across the abstracts in the query. It's a step from there to create a wordle perhaps? edited to add: though, now that I look at it, I'm not sure that will work right for your request, or maybe take a couple steps? :D

ADD COMMENT
0
Entering edit mode

Yeah, not exactly as simply interpretable as a wordle, but XplorMed is actually pretty cool. Thanks for the link...

ADD REPLY

Login before adding your answer.

Traffic: 2615 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6