Question

How To Retrieve Pmc Id And Its Corresponding Abstract From Pmc Open Access Subset?

0

Entering edit mode

10.7 years ago

zhenyisong ▴ 160

I am trying to do text classification for Pubmed abstracts. I downloaded the data from there. My initial plan is to write a Perl script to store each abstract in its txt format with PMC id or PubMed id as the file title, which would facilitate to use GNAT in the following analysis. A hash structure which uses PMC id as key and abstract as its value would be a choice. My question is how to write the regular expression pattern to extract those information from nxml data. I read the DTD fille, and found that each article may have at least one abstract and may have 0/1 PMC id. If I have to discard those article without PMC id, how to do? Or if you have any better suggestion, do let me know.

I hope that I put my question in clear sense.

pubmed perl • 3.8k views

ADD COMMENT • link updated 3.0 years ago by Ram 44k • written 10.7 years ago by zhenyisong ▴ 160

Ram · Answer 1 · 2014-04-08

0

Entering edit mode

10.6 years ago

5heikki 11k

I'd go with Entrez Direct

ADD COMMENT • link 10.6 years ago by 5heikki 11k

0

Entering edit mode

This program is very convenient for me to extract batches of files from NCBI. However, my need is to parse the local XML file. At least from their examples, I could not solve my problem. Or would you please give me a handy tip to complete a piece of the script to parse all nxml files using Entrez Direct, so the result file contains the PMID and the corresponding abstract?

ADD REPLY • link 10.4 years ago by zhenyisong ▴ 160

0

Entering edit mode

Just read the documentation. Here's something I did a few days a go (parsing highlighted, adapt to your needs, to see available elements, drop the xtract pipe).

while read line
do
lineage=$(efetch -db protein -id $line -format xml | xtract -element OrgName_lineage)
#______________________________________^-------------------------------------------^
echo "$line\t$lineage"
done<1154_GIs.txt

ADD REPLY • link updated 3.0 years ago by Ram 44k • written 10.4 years ago by 5heikki 11k

Ram · Answer 2 · 2014-04-08

0

Entering edit mode

10.6 years ago

Pierre Lindenbaum 164k

You could loop over the *.nxml files in the tar.gz and extract the data with xmlling + xpath

tar tvfz file.tar.gz | tr -s " " | cut -d ' ' -f 6 |  grep -E '.nxml$'  | while read P;
do 
     tar xOzf file.tar.gz ${P}   | xmllint --xpath  "//article-meta/article-id[@pub-id-type='pmid']/text()" -
     tar xOzf file.tar.gz ${P}   | xmllint --xpath  "//abstract//./text()"  -
done

ADD COMMENT • link updated 3.0 years ago by Ram 44k • written 10.6 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

It takes time to digest this script.

ADD REPLY • link 10.4 years ago by zhenyisong ▴ 160