I am asking this purely out of curiosity (I have no plans to actually do this), but I wonder how the various groups who analyse Pubmed abstracts for co-citation or gene-interaction (or anything else) actually access the data ?
Do they run a lot of batch pubmed searches and parse the results, do they access the pubmed database via some kind of web-service API, or is there a way of downloading the entire corpus and do the analyses locally ?
Assuming that there are several possibilities, what would be the preferred way ?
With regard to databases, it's very easy to parse PubMed XML into a hash and just drop it into a document-oriented database, such as MongoDB. This is the approach I used for my PubMed retractions project: https://github.com/neilfws/PubMed.
Great, thanks for the thoughtful answer. I learned a lot.