Question

PMID multiple pdfs download using R

0

Entering edit mode

22 months ago

Confused_human ▴ 30

Hello everyone,

I have done a mesh search in pubmed for some certain keywords, I have got PMIDs for the result, Now I want to download all pdfs using those PMIDs in R.

I have used this R script

# Install and load necessary packages
install.packages(c("rentrez", "rcrossref", "xml2"))
library(rentrez)
library(rcrossref)
library(xml2)

# Function to get DOI from PubMed ID
get_doi <- function(pmid) {
  query <- entrez_fetch(db = "pubmed", id = pmid, rettype = "medline", parsed = TRUE)
  doi <- xpathSApply(query, "//PubmedArticleSet/PubmedArticle/PubmedData/ArticleIdList/ArticleId[@IdType='doi']", xmlValue)
  if (length(doi) > 0) {
    return(doi)
  } else {
    return(NULL)
  }
}

# Function to download PDF to a local folder
download_pdf <- function(doi, folder) {
  pdf_link <- cr_works(doi = doi)$data$links[
    cr_works(doi = doi)$data$content.type == "application/pdf"
  ]

  if (!is.null(pdf_link)) {
    filename <- paste0(folder, "/", doi_to_filename(doi, ".pdf"))
    download.file(pdf_link$url, destfile = filename, mode = "wb")
  }
}

# Main function to download PDFs
download_pdfs <- function(pmids, folder = "pdf_folder") {
  # Create folder if it doesn't exist
  if (!file.exists(folder)) {
    dir.create(folder)
  }

  for (pmid in pmids) {
    doi <- get_doi(pmid)

    if (!is.null(doi)) {
      download_pdf(doi, folder)
    }
  }
}

# Download PDFs for the given PubMed IDs
download_pdfs(pmids)

But I am getting empty folders and this error:

download_pdfs(pmids)
Error in entrez_fetch(db = "pubmed", id = pmid, rettype = "medline", parsed = TRUE) :
At present, entrez_fetch can only parse XML records, got medline

Please help me with this if it can be done using some other way ?

Thank you

R PubMed • 1.3k views

ADD COMMENT • link updated 22 months ago by Ram 45k • written 22 months ago by Confused_human ▴ 30

0

Entering edit mode

I have used this script

#!/usr/bin/env bash

Link="http://www.ncbi.nlm.nih.gov/pubmed/"
PMCLink="http://www.ncbi.nlm.nih.gov/pmc/articles/"
ID=(12137684
15108869
10795469
12449267
12843113
10795470
12481673
12091293
11916176
11874891
11696844
10952713
10968596
12185014
12391006
12741897
11992146
12047535
12727374
12648317
12618674
12734296
10684911
11760852
11298126
12010975
12598511)

for f in ${ID[@]};
do
  PMCID=$(wget  --user-agent="Mozilla/5.0 (Windows NT 5.2; rv:2.0.1) Gecko/20100101 Firefox/4.0.1" \
   -l1 --no-parent ${Link}${f} -O - 2>/dev/null | grep -Po 'PMC\d+' | head -n 1)
    if [ $PMCID ]; then
       wget  --user-agent="Mozilla/5.0 (Windows NT 5.2; rv:2.0.1) Gecko/20100101 Firefox/4.0.1" \
            -l1 --no-parent -A.pdf ${PMCLink}${PMCID}/pdf/ -O ${f}.pdf 2>/dev/null
    else
       echo "No PMC ID for $f"
    fi

done

and it has downloaded 14 pdfs and for remaining I got this error :

No PMC ID for 10795469
No PMC ID for 12449267
No PMC ID for 10795470
No PMC ID for 12481673
No PMC ID for 11696844
No PMC ID for 12185014
No PMC ID for 12741897
No PMC ID for 11992146
No PMC ID for 12047535
No PMC ID for 12727374
No PMC ID for 12618674
No PMC ID for 10684911
No PMC ID for 11760852

how to resolve this issue ?

ADD REPLY • link updated 22 months ago by GenoMax 153k • written 22 months ago by Confused_human ▴ 30

0

Entering edit mode

The message is clear:

No PMC ID for 11760852

Those PMID numbers are not PMC records, so they don't have PMC IDs. That's probably why you can't download them.

ADD REPLY • link 22 months ago by Mensur Dlakic ★ 29k

1

Entering edit mode

That is a PMID but there is no free text available directly from NCBI. OP needs to visit the publisher site. Looks like free text is available there.

ADD REPLY • link 22 months ago by GenoMax 153k