Question

How to parse a pubmed abstract that contains html tags or Unicode in python?

0

Entering edit mode

6.7 years ago

aqsaawan.459 • 0

I'm working on a project of data mining which which use pubmed articles xml files to read in python and parse its data to database, But problem is that some inline tags like <sup></sup> don't read complete text, for example the abstract of a paper. The code to read xml file and the xml file which i'm trying to read is posted here.

python xml pubmed parsing html • 3.5k views

ADD COMMENT • link updated 6.7 years ago by Pierre Lindenbaum 164k • written 6.7 years ago by aqsaawan.459 • 0

1

Entering edit mode

Seems like your forgot some input informations :

What your xml file looks like ?
What did you try in python ?
What do you want to achieve ?
Submit an example that doesn't work as expected

ADD REPLY • link 6.7 years ago by Bastien Hervé 5.9k

0

Entering edit mode

Please provide a reproducible example of "xml files" and your code.

ADD REPLY • link 6.7 years ago by Michael 55k

0

Entering edit mode

The tag intrupts text reading in python like it stops "Categorical variables were compared using the χ2test." at X and don't print further text

    <Abstract>
         <AbstractText> The disease free survival (DFS) and overall survival (OS) were calculated by 
         the Kaplan-Meier method. Categorical variables were compared using the 
         χ<sup>2</sup>test.</AbstractText>
    </Abstract>

ADD REPLY • link updated 6.7 years ago by Michael 55k • written 6.7 years ago by aqsaawan.459 • 0

0

Entering edit mode

Please provide a reproducible example of "xml files" and your code. Please edit your original post to make it coherent.

ADD REPLY • link 6.7 years ago by Michael 55k

0

Entering edit mode

Sometimes pubmed abstracts may contain rudimentary html code, those shouldn't cause problems, but without the code you are using it is hard to say. I have edited the title to better reflect what this is about.

ADD REPLY • link 6.7 years ago by Michael 55k

0

Entering edit mode

Looks like python struggle with the "χ" of your χ2test (chi 2 test). This is a special character, you need to take it into account. Would you please, share a link to your xml file and your python code.

ADD REPLY • link 6.7 years ago by Bastien Hervé 5.9k

0

Entering edit mode

Does python have problems with unicode?

ADD REPLY • link 6.7 years ago by Michael 55k

score 0 · Answer 1 · 2018-04-04

0

Entering edit mode

6.7 years ago

Bastien Hervé 5.9k

Try this in your python code :

import codecs
with codecs.open("greek.xml", 'r', encoding='ISO-8859-7') as f:
    for line in f:
        print(line)

Assuming that your χ (chi lowercase) is encoded under the ISO-8859-7 norm ( seems to be : https://en.wikipedia.org/wiki/ISO/IEC_8859-7 )

ADD COMMENT • link 6.7 years ago by Bastien Hervé 5.9k

score 0 · Answer 2 · 2018-04-04

How to parse a pubmed abstract that

you use a XSLT stylesheet to extract the fields you need. e.g: Pubmed Xml To Other Format, Such As Enw ; What is the simplest way to go from Pubmed ID to citation, programmatically? ; Importing Pubmed Medline Details Into A Local Rdbms To Execute Data Mining Methods ; parse nxml file from pubmed ; Getting Tab-Delimited Pmids And Abstracts From Pubmed ; https://www.biostars.org/p/14051/; ...