How To Retrieve Pmc Id And Its Corresponding Abstract From Pmc Open Access Subset?
2
0
Entering edit mode
10.7 years ago
zhenyisong ▴ 160

I am trying to do text classification for Pubmed abstracts. I downloaded the data from there. My initial plan is to write a Perl script to store each abstract in its txt format with PMC id or PubMed id as the file title, which would facilitate to use GNAT in the following analysis. A hash structure which uses PMC id as key and abstract as its value would be a choice. My question is how to write the regular expression pattern to extract those information from nxml data. I read the DTD fille, and found that each article may have at least one abstract and may have 0/1 PMC id. If I have to discard those article without PMC id, how to do? Or if you have any better suggestion, do let me know.

I hope that I put my question in clear sense.

pubmed perl • 3.8k views
ADD COMMENT
0
Entering edit mode
10.6 years ago
5heikki 11k

I'd go with Entrez Direct

ADD COMMENT
0
Entering edit mode

This program is very convenient for me to extract batches of files from NCBI. However, my need is to parse the local XML file. At least from their examples, I could not solve my problem. Or would you please give me a handy tip to complete a piece of the script to parse all nxml files using Entrez Direct, so the result file contains the PMID and the corresponding abstract?

ADD REPLY
0
Entering edit mode

Just read the documentation. Here's something I did a few days a go (parsing highlighted, adapt to your needs, to see available elements, drop the xtract pipe).

while read line
do
lineage=$(efetch -db protein -id $line -format xml | xtract -element OrgName_lineage)
#______________________________________^-------------------------------------------^
echo "$line\t$lineage"
done<1154_GIs.txt
ADD REPLY
0
Entering edit mode
10.6 years ago

You could loop over the *.nxml files in the tar.gz and extract the data with xmlling + xpath

tar tvfz file.tar.gz | tr -s " " | cut -d ' ' -f 6 |  grep -E '.nxml$'  | while read P;
do 
     tar xOzf file.tar.gz ${P}   | xmllint --xpath  "//article-meta/article-id[@pub-id-type='pmid']/text()" -
     tar xOzf file.tar.gz ${P}   | xmllint --xpath  "//abstract//./text()"  -
done
ADD COMMENT
0
Entering edit mode

It takes time to digest this script.

ADD REPLY

Login before adding your answer.

Traffic: 2331 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6