I am trying to do text classification for Pubmed abstracts. I downloaded the data from there. My initial plan is to write a Perl script to store each abstract in its txt format with PMC id or PubMed id as the file title, which would facilitate to use GNAT in the following analysis. A hash structure which uses PMC id as key and abstract as its value would be a choice. My question is how to write the regular expression pattern to extract those information from nxml data. I read the DTD fille, and found that each article may have at least one abstract and may have 0/1 PMC id. If I have to discard those article without PMC id, how to do? Or if you have any better suggestion, do let me know.
I hope that I put my question in clear sense.
This program is very convenient for me to extract batches of files from NCBI. However, my need is to parse the local XML file. At least from their examples, I could not solve my problem. Or would you please give me a handy tip to complete a piece of the script to parse all nxml files using Entrez Direct, so the result file contains the PMID and the corresponding abstract?
Just read the documentation. Here's something I did a few days a go (parsing highlighted, adapt to your needs, to see available elements, drop the xtract pipe).