I want to get article texts from PMC xml data with Python.
I used an event driven parsing but it doesn't work well with xml like this:
<p>There is a limimited ... Journal,
<ext-link ext-link-type="uri" xlink:href="http://...index.asp">
http://...index.asp</ext-link>). Published ... </p>
I thinks the problem with tags-text-other_tag-text-tag sequences. And when I use event driven parsing I get something like this:
Code example:
n = 0
result = ""
for event, element in etree.iterparse(fh, events=['start', 'end']):
if event == "start":
n += 2
result += "%s<%s>\n" % (" "*n, element.tag)
result += "%s%s\n" % (" "*(n+2), element.text)
if event == "end":
result += "%s</%s>\n" % (" "*n, element.tag)
n -= 2
output of this code:
There is a limimited ... Journal,
<ext-link>
<http://www.aapsj.org/theme_issues/virtual/index.asp>
</ext-link>
<xref>
1
</xref>
<xref>
2
</xref>
<xref>
3
</xref>
<xref>
4
</xref>
I suspect that there is some simple solution for this case. How do you handle xml like this with Python?
Is there any tools/libraries for converting PMC XML to raw text?
show us the code please :-)
the code example added =)
I don't know your python parser but I would guess that it is only DATA oriented (it only expects text data between two tags).
Your googleable code (http://code.google.com/p/lindenb/source/browse/trunk/src/xsl/pubmed2exhibit.xsl) was a helpfull hint for right direction =) Now I know that any SAX parser in my case is bad idea, better to use XSLT
This might help.