Anyone know how to extract full-text articles from PubMed Central in some structured format? The page on the PMC OAI Service seems relevant, but I can't figure out how to actually use it...
Anyone know how to extract full-text articles from PubMed Central in some structured format? The page on the PMC OAI Service seems relevant, but I can't figure out how to actually use it...
You can download everything from the PMC FTP site, however, it is organized in a somewhat inconvenient manner. I have a mirror of the PMC Open Access subset, which is automatically kept up-to-date with the main FTP site on a weekly basis. You can find my mirror at http://pmc.jensenlab.org/
The mirror is organized differently than the main site. This enables you to access publications by PMCID or PMID and choose to download either the XML text files (.nxml) or the complete article archives (.tar.gz) using simple URLs:
You can use wget to download the entire collection of the XML text files (.nxml) or the complete article archives (.tar.gz):
wget --accept=nxml --mirror http://pmc.jensenlab.org/pmcid
wget --accept=tar.gz --mirror http://pmc.jensenlab.org/pmcid
Maybe getting them via FTP would be more straightforward for the Open Access Subset?
I actually have a set of python functions for retrieving them if you know the PMID or have a search term.
Its not broken into its own function but you can see the code here: http://github.com/JudoWill/pyMutF/blob/master/DistAnnot/PubmedUtils.py
If you have the PMCIDs then you can use GetXMLfromList(ID_LIST, db = 'pmc')
It will create a semaphore for keeping yourself under the NCBI request limit.
Hope that helps,
Will
I agree this page is useless. I believe it assumes you are familiar with the OAI-PMH (The Open Archives Initiative Protocol for Metadata Harvesting). Putting pieces together from this resource, some limited documentation about the PMC-OAI at the UKPMC website, and a blog post from Chemspider that provides some examples of how to call the PMC-OAI service, I've been able to summarize the following example calls:
Get PMC records in the OA subset: http://www.pubmedcentral.nih.gov/oai/oai.cgi?verb=ListRecords&metadataPrefix=pmc&set=pmc-open
Get PMC identifiers in the OA subset: http://www.pubmedcentral.nih.gov/oai/oai.cgi?verb=ListIdentifiers&metadataPrefix=pmc&set=pmc-open
Get an individual record using a PMC ID: http://www.pubmedcentral.nih.gov/oai/oai.cgi?verb=GetRecord&metadataPrefix=pmc&identifier=oai:pubmedcentral.nih.gov:13900
Get records from a specific date: http://www.pubmedcentral.nih.gov/oai/oai.cgi?verb=ListRecords&from=2007-10-01&metadataPrefix=pmc
Get records from a range of dates: http://www.pubmedcentral.nih.gov/oai/oai.cgi?verb=ListRecords&metadataPrefix=pmc&set=pmc-open&from=2007-11-17&until=2007-11-18
My understanding is that the list of actions that can be applied to the PMC-OAI service should be the same as the general OAI-PMH service: http://www.openarchives.org/OAI/openarchivesprotocol.html#ProtocolMessages
From the PMC "help"-desk: Q: "Is there any documentation at NCBI beyond this page: http://www.ncbi.nlm.nih.gov/pmc/tools/oai/". A: "No, there is no other documentation. You need to read it on http://www.openarchives.org/ if you are not familiar with OAI."
If JSON
or XML
is what your looking for, look at the BioC format, where JSON
and XML
is available for download:
https://www.ncbi.nlm.nih.gov/research/bionlp/APIs/BioC-PMC/
The corpus contains the PubMed Central (PMC) Open Access articles and the corpus comes along with a publication (see link). PubMed ID and PMC ID are recognized by the API.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Thanks all for the fantastic ideas and answers. All get up-votes, but Lars wins for solving my problem in a better way than I asked my question...