What Taverna 2 workflow can I use to do a query against the Open Access subsection of PubMed and return a subset of up to 1500 papers, for further processing, including text mining? If no complete workflow is available, I am interested in workflows that do similar things, particularly if they are hosted on MyExperiment.org. I am happy if it involved new plugins. I'm also happy with solutions that include the use of XSLT and BeanShell scripts.
Update: the bounty goes to the most functional (open source) Taverna BeanShell script.
I do not know a Taverna workflow that does this already, but you can easily retrieve an XML-document with PMC-open identifiers via:
http://www.pubmedcentral.nih.gov/oai/oai.cgi?verb=ListIdentifiers&metadataPrefix=pmc&set=pmc-open
From that XML-document you can then get the individual records via further queries, such as: http://www.pubmedcentral.nih.gov/oai/oai.cgi?verb=GetRecord&identifier=oai:pubmedcentral.nih.gov:17827&metadataPrefix=pmc
More about PMC-OAI at: http://www.ncbi.nlm.nih.gov/pmc/about/oai.html
Hope this helps a bit, even though I cannot answer your original question.
Would you be happy to use UkPubmed instead of pubmed? http://ukpmc.ac.uk/ Ukpubmed indexes only open-access documents and compared to pubmed, it indexes PhD thesis published in the UK. not sure if I have time to prepare a taverna workflow for this, but if you look at the page it should not be difficult.
Absolutely, either is fine!