Pmc Xml Parsing
2
1
Entering edit mode
13.5 years ago
Alex ★ 1.5k

I want to get article texts from PMC xml data with Python.

I used an event driven parsing but it doesn't work well with xml like this:

<p>There is a limimited ... Journal, 
<ext-link ext-link-type="uri" xlink:href="http://...index.asp">
http://...index.asp</ext-link>). Published ... </p>

I thinks the problem with tags-text-other_tag-text-tag sequences. And when I use event driven parsing I get something like this:

Code example:

n = 0
result = ""
for event, element in etree.iterparse(fh, events=['start', 'end']):
        if event == "start":
            n += 2
            result += "%s<%s>\n" % (" "*n, element.tag)
            result += "%s%s\n" % (" "*(n+2), element.text)
        if event == "end":
            result += "%s</%s>\n" % (" "*n, element.tag)
            n -= 2

output of this code:

      

There is a limimited ... Journal, <ext-link> <http://www.aapsj.org/theme_issues/virtual/index.asp> </ext-link> <xref> 1 </xref> <xref> 2 </xref> <xref> 3 </xref> <xref> 4 </xref>

I suspect that there is some simple solution for this case. How do you handle xml like this with Python?

Is there any tools/libraries for converting PMC XML to raw text?

pubmed xml parsing • 5.8k views
ADD COMMENT
1
Entering edit mode

show us the code please :-)

ADD REPLY
0
Entering edit mode

the code example added =)

ADD REPLY
0
Entering edit mode

I don't know your python parser but I would guess that it is only DATA oriented (it only expects text data between two tags).

ADD REPLY
0
Entering edit mode

Your googleable code (http://code.google.com/p/lindenb/source/browse/trunk/src/xsl/pubmed2exhibit.xsl) was a helpfull hint for right direction =) Now I know that any SAX parser in my case is bad idea, better to use XSLT

ADD REPLY
0
Entering edit mode

This might help.

ADD REPLY
2
Entering edit mode
13.5 years ago
Luwening ▴ 50

you would see this http://www.ncbi.nlm.nih.gov/pmc/pmcdoc/tagging-guidelines/article/style.html and the DTD http://dtd.nlm.nih.gov/publishing/ and Biopython may help you parse the xml file(also depend on DTD)

ADD COMMENT
1
Entering edit mode
13.1 years ago
Chris Maloney ▴ 360

This question seems to be specifically about parsing XML with Python, and not about bioinformatics or about the NLM DTDs (JATS). I am not a Python user, but I'd suggest asking on the python mailing list, or Googling "parsing xml with python".

ADD COMMENT

Login before adding your answer.

Traffic: 1677 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6