parse xml proteinpredict files
1
0
Entering edit mode
6.9 years ago
Biojl ★ 1.7k

Hi,

I have several thousands of xml files from https://www.predictprotein.org/ calculations for different proteins. I was wondering if anyone knows a package to parse that information in python or R in order to be able to perform some calculations easily.

I am mostly interested in the secondary structure information. Obtaining the relative coordinates of every feature and convert it to a bed file. An example file here: https://raw.githubusercontent.com/gyachdav/pp-results/master/examples/ADRB2_HUMAN.xml

proteinpredict parse xml biopython • 1.8k views
ADD COMMENT
0
Entering edit mode

please provide a sample of XML. What kind of information do you want to retrieve ? Most a of the time, a simple XSL stylesheet to the job if it is a simple query .

ADD REPLY
0
Entering edit mode

Hi, I have updated the question with the relevant information, including an example file. I am not familiar with XSL stylesheets but I'll dig into it.

ADD REPLY
0
Entering edit mode

. Obtaining the relative coordinates of every feature and convert it to a bed file

That's not clear to me. give me a few lines for an example please.

ADD REPLY
0
Entering edit mode

I'm interested in the info inside this feature: <featuretypegroup type="secondary structures">...</featuretypegroup>

Some example lines inside those tags:

<feature type="helix" soTermId="SO:0001114">
<location>
<begin position="89"/>
<end position="108"/>
</location>
</feature>
<feature type="strand" soTermId="SO:0001111">
<location>
<begin position="109"/>
<end position="113"/>
</location>

The idea would be to get:

ENSP0001, helix, 89, 108 \n
ENSP0001, strand, 109, 113
ADD REPLY
0
Entering edit mode

I cannot find ENSP0001 in your example.

ADD REPLY
0
Entering edit mode

Is not there, is a made up protein ID. The protein ID is usually in the file name.

ADD REPLY
2
Entering edit mode
6.9 years ago

using the following xslt stylesheet:

$ xsltproc biostars293423.xsl input.xml

ADRB2_HUMAN protein binding region  26  28  
ADRB2_HUMAN protein binding region  147 147 
ADRB2_HUMAN protein binding region  179 180 
ADRB2_HUMAN protein binding region  236 236 
ADRB2_HUMAN protein binding region  243 243 
ADRB2_HUMAN protein binding region  248 252 
ADRB2_HUMAN protein binding region  343 347 
ADRB2_HUMAN disordered region   9   10  
ADRB2_HUMAN disordered region   19  21  
ADRB2_HUMAN disordered region   389 389 
ADRB2_HUMAN disordered region   406 406 
ADRB2_HUMAN disordered region   1   30  
ADRB2_HUMAN disordered region   61  65  
ADRB2_HUMAN disordered region   140 148 
ADRB2_HUMAN disordered region   175 183 
ADRB2_HUMAN disordered region   186 196 
ADRB2_HUMAN disordered region   228 270 
ADRB2_HUMAN disordered region   299 305 
ADRB2_HUMAN disordered region   330 331 
ADRB2_HUMAN disordered region   334 334 
ADRB2_HUMAN disordered region   343 413 
ADRB2_HUMAN disordered region   21  25  
ADRB2_HUMAN disordered region   228 233 
ADRB2_HUMAN disordered region   235 235 
ADRB2_HUMAN disordered region   359 359 
ADRB2_HUMAN disordered region   366 376 
ADRB2_HUMAN disordered region   394 401 
ADRB2_HUMAN disordered region   404 405 
ADRB2_HUMAN disordered region   1   2   
ADRB2_HUMAN disordered region   356 359 
ADRB2_HUMAN disordered region   363 376 
ADRB2_HUMAN disordered region   379 379 
ADRB2_HUMAN disordered region   381 411 
ADRB2_HUMAN strand  31  38  
ADRB2_HUMAN helix   39  41  
ADRB2_HUMAN strand  42  48  
ADRB2_HUMAN strand  52  60  
ADRB2_HUMAN helix   68  86  
ADRB2_HUMAN helix   89  97  
ADRB2_HUMAN helix   103 115 
ADRB2_HUMAN helix   117 122 
ADRB2_HUMAN strand  123 128 
ADRB2_HUMAN strand  133 136 
ADRB2_HUMAN helix   147 151 
ADRB2_HUMAN strand  152 154 
ADRB2_HUMAN helix   155 165 
ADRB2_HUMAN strand  170 175 
ADRB2_HUMAN strand  197 205 
ADRB2_HUMAN helix   209 231 
ADRB2_HUMAN helix   262 272 
ADRB2_HUMAN strand  273 285 
ADRB2_HUMAN helix   288 298 
ADRB2_HUMAN helix   305 318 
ADRB2_HUMAN strand  324 327 
ADRB2_HUMAN helix   330 342
ADD COMMENT
0
Entering edit mode

Wow, definitely is worth it to learn how to use XSL!! Just one question about the code, in which part you select the info for the first column (ADRB2_HUMAN)? I don't understand this language yet.

ADD REPLY
0
Entering edit mode

n which part you select the info for the first column (ADRB2_HUMAN)

<xsl:value-of select="/p:predictprotein/p:entry/p:accession"/>
ADD REPLY

Login before adding your answer.

Traffic: 1408 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6