Question

How to download lakhs of abstracts in Batch for text mining study

1

Entering edit mode

4.3 years ago

rohitsatyam102 ▴ 920

Hi everyone!!

I am new to text mining studies carried out in bioinformatics. I was learning how to use pubmed.mineR package for text mining using mulit-abstract txt file. I read a lot on RISmed package and easypubmed R packages but it seems they have some disadvantage when you have to use their output in pubmed.mineR.

pubmed.mineR uses a text file containing multiple abstracts in "abstract" format of Pubmed (one format abovet many other pubmed formats such as xml). easyPubmed have such function that can be exploited while RISmed doesn't it seems. On the other hand you can retrieve only 5000 abstracts at a time using easyPubmed while RISmed has no such upper limit.

I randomly chose a cancer type "oral cancer" and it had above 1 lakh PMIDs. RISmed successfully retrieved the abstracts(1.1 GB data) however it's output is incompatible for pubmed.mineR. On the other hand, though easypubmed had a compatible output, it has retrieval limit since it use PubMed API at the backend.

Is there a way using CLI to retrieve all ~1 lakh of abstracts in "abstract" format from pubmed since the website itself after June 2020 update has set a limit to 10,000 abstracts at a time. Here I attach a short code I used with easypubmed

library("easyPubMed")
search_topic <- 'oral cancer'
my_entrez_id <- get_pubmed_ids(search_topic)
my_entrez_id$Count
?fetch_pubmed_data
my_abstracts_txt <- fetch_pubmed_data(my_entrez_id, retmax = 142000, format = "abstract")
writeLines(my_abstracts_txt, con = "oral_cancer.txt")

R gene • 2.0k views

ADD COMMENT • link 4.3 years ago by rohitsatyam102 ▴ 920

0

Entering edit mode

4.3 years ago

rohitsatyam102 ▴ 920

Sharing some initial sources that I referred so that it stays all at one place Sources Referred

ADD COMMENT • link 4.3 years ago by rohitsatyam102 ▴ 920

score 3 · Accepted Answer · 2020-09-01

3

Entering edit mode

4.3 years ago

GenoMax 148k

You can use Entrezdirect (example output truncated). Be sure to sign-up for and use NCBI_API_KEY.

$ esearch -db pubmed -query "Oral cancer" | efetch -format abstract >> some_file

Am J Otolaryngol. 2020 Aug 16;41(6):102685. doi: 10.1016/j.amjoto.2020.102685. [Epub ahead of print]

HPV vaccination practices and attitudes among primary care physicians since FDA approval to age 45.

Petrusek J(1), Thorpe E(2), Britt CJ(2).

Author information: (1)Department of Otolaryngology, Loyola University Medical Center, Maywood, IL 60153, United States of America. Electronic address: Jeffrey.Petrusek@lumc.edu. (2)Department of Otolaryngology, Loyola University Medical Center, Maywood, IL 60153, United States of America.

PURPOSE: The aim of this study was to examine HPV vaccine administration practices since FDA approval to age 45 and assess knowledge regarding HPV and its association with oropharyngeal cancer. METHODS: A survey was distributed to 86 primary care physicians at Loyola University Medical Center. The survey contained 11 questions designed to capture HPV vaccination practices, knowledge of FDA approval, and barriers to vaccination. RESULTS: 46 (53%) physicians completed the survey and 45 responses were included. Among respondents who treat males ages 9-21 and females ages 9-26, the

ADD COMMENT • link 4.3 years ago by GenoMax 148k

0

Entering edit mode

Wow, this is such an effortless way of doing that. However, I noticed that the numbering of the abstract starts from 1 again after every 100 abstracts. Hope that won't cause any trouble during the analysis though!!.

Thanks a lot for the solution.

ADD REPLY • link 4.3 years ago by rohitsatyam102 ▴ 920

0

Entering edit mode

(example output truncated).

I changed the

> some_file

to

>> somefile

Appending helps I think. Is that truncation you are referring to?

ADD REPLY • link 4.3 years ago by rohitsatyam102 ▴ 920

1

Entering edit mode

I was referring to truncation of the example. I have updated my answer to append data to the file.

ADD REPLY • link 4.3 years ago by GenoMax 148k

0

Entering edit mode

Wonderful. Thanks for your time. I really appreciate it. I will start looking out of R from now for solutions

ADD REPLY • link 4.3 years ago by rohitsatyam102 ▴ 920

0

Entering edit mode

Be sure to sign-up for and use NCBI_API_KEY.

I am using entrezdirect package in conda. It seems not to ask for any such key.

Checking how many abstracts were fetched

grep 'PMID' abstract.txt | wc -l

ADD REPLY • link 4.3 years ago by rohitsatyam102 ▴ 920

1

Entering edit mode

It won't ask. NCBI will start throttling your queries based on description in the link I provided above, if you don't use the key.

ADD REPLY • link 4.3 years ago by GenoMax 148k

0

Entering edit mode

I can not access this paper. Can you share the author's copy if available? My email ID is rohitsatyam102@gmail.com

ADD REPLY • link 4.3 years ago by rohitsatyam102 ▴ 920

1

Entering edit mode

[Epub ahead of print]

That paper has not been published yet. You will find that journals will pre-send paper listings to PubMed before actual publication.

ADD REPLY • link 4.3 years ago by GenoMax 148k