Full Text Retrieval From Pubmedcentral
5
9
Entering edit mode
14.3 years ago
Andrew Su 4.9k

Anyone know how to extract full-text articles from PubMed Central in some structured format? The page on the PMC OAI Service seems relevant, but I can't figure out how to actually use it...

pubmed api literature text • 15k views
ADD COMMENT
0
Entering edit mode

Thanks all for the fantastic ideas and answers. All get up-votes, but Lars wins for solving my problem in a better way than I asked my question...

ADD REPLY
12
Entering edit mode
14.3 years ago

You can download everything from the PMC FTP site, however, it is organized in a somewhat inconvenient manner. I have a mirror of the PMC Open Access subset, which is automatically kept up-to-date with the main FTP site on a weekly basis. You can find my mirror at http://pmc.jensenlab.org/

The mirror is organized differently than the main site. This enables you to access publications by PMCID or PMID and choose to download either the XML text files (.nxml) or the complete article archives (.tar.gz) using simple URLs:

You can use wget to download the entire collection of the XML text files (.nxml) or the complete article archives (.tar.gz):

wget --accept=nxml --mirror http://pmc.jensenlab.org/pmcid

wget --accept=tar.gz --mirror http://pmc.jensenlab.org/pmcid

ADD COMMENT
0
Entering edit mode

The above site is broken, is there a problem or the site has been taken down?

ADD REPLY
0
Entering edit mode

No, just having problems with Apache crashing. I just restarted it, but it keeps happening :-/

ADD REPLY
4
Entering edit mode
14.3 years ago
User 59 13k

Maybe getting them via FTP would be more straightforward for the Open Access Subset?

http://www.ncbi.nlm.nih.gov/pmc/about/ftp.html

ADD COMMENT
1
Entering edit mode

But if you're just after the full text, go further down the page, and there are links to the 4 tgz's which just have the XML for data mining

ADD REPLY
0
Entering edit mode

Ah, if only one could do a parallel GET. Can you?

ADD REPLY
0
Entering edit mode
ADD REPLY
3
Entering edit mode
14.3 years ago
Will 4.6k

I actually have a set of python functions for retrieving them if you know the PMID or have a search term.

Its not broken into its own function but you can see the code here: http://github.com/JudoWill/pyMutF/blob/master/DistAnnot/PubmedUtils.py

If you have the PMCIDs then you can use GetXMLfromList(ID_LIST, db = 'pmc') It will create a semaphore for keeping yourself under the NCBI request limit.

Hope that helps,

Will

ADD COMMENT
0
Entering edit mode

you can also use SearchPUBMED to search any arbitrary query and get a list of IDs

ADD REPLY
3
Entering edit mode
13.1 years ago

I agree this page is useless. I believe it assumes you are familiar with the OAI-PMH (The Open Archives Initiative Protocol for Metadata Harvesting). Putting pieces together from this resource, some limited documentation about the PMC-OAI at the UKPMC website, and a blog post from Chemspider that provides some examples of how to call the PMC-OAI service, I've been able to summarize the following example calls:

Get PMC records in the OA subset: http://www.pubmedcentral.nih.gov/oai/oai.cgi?verb=ListRecords&metadataPrefix=pmc&set=pmc-open

Get PMC identifiers in the OA subset: http://www.pubmedcentral.nih.gov/oai/oai.cgi?verb=ListIdentifiers&metadataPrefix=pmc&set=pmc-open

Get an individual record using a PMC ID: http://www.pubmedcentral.nih.gov/oai/oai.cgi?verb=GetRecord&metadataPrefix=pmc&identifier=oai:pubmedcentral.nih.gov:13900

Get records from a specific date: http://www.pubmedcentral.nih.gov/oai/oai.cgi?verb=ListRecords&from=2007-10-01&metadataPrefix=pmc

Get records from a range of dates: http://www.pubmedcentral.nih.gov/oai/oai.cgi?verb=ListRecords&metadataPrefix=pmc&set=pmc-open&from=2007-11-17&until=2007-11-18

My understanding is that the list of actions that can be applied to the PMC-OAI service should be the same as the general OAI-PMH service: http://www.openarchives.org/OAI/openarchivesprotocol.html#ProtocolMessages

ADD COMMENT
0
Entering edit mode

From the PMC "help"-desk: Q: "Is there any documentation at NCBI beyond this page: http://www.ncbi.nlm.nih.gov/pmc/tools/oai/". A: "No, there is no other documentation. You need to read it on http://www.openarchives.org/ if you are not familiar with OAI."

ADD REPLY
0
Entering edit mode
24 months ago

If JSON or XML is what your looking for, look at the BioC format, where JSON and XML is available for download: https://www.ncbi.nlm.nih.gov/research/bionlp/APIs/BioC-PMC/

The corpus contains the PubMed Central (PMC) Open Access articles and the corpus comes along with a publication (see link). PubMed ID and PMC ID are recognized by the API.

ADD COMMENT

Login before adding your answer.

Traffic: 2229 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6