parse xml proteinpredict files

0

Entering edit mode

7.3 years ago

Biojl ★ 1.7k

Hi,

I have several thousands of xml files from https://www.predictprotein.org/ calculations for different proteins. I was wondering if anyone knows a package to parse that information in python or R in order to be able to perform some calculations easily.

I am mostly interested in the secondary structure information. Obtaining the relative coordinates of every feature and convert it to a bed file. An example file here: https://raw.githubusercontent.com/gyachdav/pp-results/master/examples/ADRB2_HUMAN.xml

proteinpredict parse xml biopython • 2.0k views

ADD COMMENT • link updated 7.3 years ago by Pierre Lindenbaum 166k • written 7.3 years ago by Biojl ★ 1.7k

0

Entering edit mode

please provide a sample of XML. What kind of information do you want to retrieve ? Most a of the time, a simple XSL stylesheet to the job if it is a simple query .

ADD REPLY • link 7.3 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Hi, I have updated the question with the relevant information, including an example file. I am not familiar with XSL stylesheets but I'll dig into it.

ADD REPLY • link 7.3 years ago by Biojl ★ 1.7k

0

Entering edit mode

. Obtaining the relative coordinates of every feature and convert it to a bed file

That's not clear to me. give me a few lines for an example please.

ADD REPLY • link 7.3 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

I'm interested in the info inside this feature: <featuretypegroup type="secondary structures">...</featuretypegroup>

Some example lines inside those tags:

<feature type="helix" soTermId="SO:0001114">
<location>
<begin position="89"/>
<end position="108"/>
</location>
</feature>
<feature type="strand" soTermId="SO:0001111">
<location>
<begin position="109"/>
<end position="113"/>
</location>

The idea would be to get:

ENSP0001, helix, 89, 108 \n
ENSP0001, strand, 109, 113

ADD REPLY • link updated 7.3 years ago by Pierre Lindenbaum 166k • written 7.3 years ago by Biojl ★ 1.7k

0

Entering edit mode

I cannot find ENSP0001 in your example.

ADD REPLY • link 7.3 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Is not there, is a made up protein ID. The protein ID is usually in the file name.

ADD REPLY • link 7.3 years ago by Biojl ★ 1.7k

2

Entering edit mode

7.3 years ago

Pierre Lindenbaum 166k

using the following xslt stylesheet:

	<?xml version='1.0' encoding="UTF-8"?>
	<xsl:stylesheet
	xmlns:p="http://www.predictprotein.org/predictprotein"
	xmlns:xsl='http://www.w3.org/1999/XSL/Transform'
	version='1.0'>
	<xsl:output method="text"/>

	<xsl:template match="/">
	<xsl:for-each select="//p:feature">
	<xsl:value-of select="/p:predictprotein/p:entry/p:accession"/>
	<xsl:text> </xsl:text>
	<xsl:value-of select="@type"/>
	<xsl:text> </xsl:text>
	<xsl:value-of select="p:location/p:begin/@position"/>
	<xsl:text> </xsl:text>
	<xsl:value-of select="p:location/p:end/@position"/>
	<xsl:text>
	</xsl:text>
	</xsl:for-each>
	</xsl:template>

	</xsl:stylesheet>

view raw biostars293423.xsl hosted with ❤ by GitHub

$ xsltproc biostars293423.xsl input.xml

ADRB2_HUMAN protein binding region  26  28  
ADRB2_HUMAN protein binding region  147 147 
ADRB2_HUMAN protein binding region  179 180 
ADRB2_HUMAN protein binding region  236 236 
ADRB2_HUMAN protein binding region  243 243 
ADRB2_HUMAN protein binding region  248 252 
ADRB2_HUMAN protein binding region  343 347 
ADRB2_HUMAN disordered region   9   10  
ADRB2_HUMAN disordered region   19  21  
ADRB2_HUMAN disordered region   389 389 
ADRB2_HUMAN disordered region   406 406 
ADRB2_HUMAN disordered region   1   30  
ADRB2_HUMAN disordered region   61  65  
ADRB2_HUMAN disordered region   140 148 
ADRB2_HUMAN disordered region   175 183 
ADRB2_HUMAN disordered region   186 196 
ADRB2_HUMAN disordered region   228 270 
ADRB2_HUMAN disordered region   299 305 
ADRB2_HUMAN disordered region   330 331 
ADRB2_HUMAN disordered region   334 334 
ADRB2_HUMAN disordered region   343 413 
ADRB2_HUMAN disordered region   21  25  
ADRB2_HUMAN disordered region   228 233 
ADRB2_HUMAN disordered region   235 235 
ADRB2_HUMAN disordered region   359 359 
ADRB2_HUMAN disordered region   366 376 
ADRB2_HUMAN disordered region   394 401 
ADRB2_HUMAN disordered region   404 405 
ADRB2_HUMAN disordered region   1   2   
ADRB2_HUMAN disordered region   356 359 
ADRB2_HUMAN disordered region   363 376 
ADRB2_HUMAN disordered region   379 379 
ADRB2_HUMAN disordered region   381 411 
ADRB2_HUMAN strand  31  38  
ADRB2_HUMAN helix   39  41  
ADRB2_HUMAN strand  42  48  
ADRB2_HUMAN strand  52  60  
ADRB2_HUMAN helix   68  86  
ADRB2_HUMAN helix   89  97  
ADRB2_HUMAN helix   103 115 
ADRB2_HUMAN helix   117 122 
ADRB2_HUMAN strand  123 128 
ADRB2_HUMAN strand  133 136 
ADRB2_HUMAN helix   147 151 
ADRB2_HUMAN strand  152 154 
ADRB2_HUMAN helix   155 165 
ADRB2_HUMAN strand  170 175 
ADRB2_HUMAN strand  197 205 
ADRB2_HUMAN helix   209 231 
ADRB2_HUMAN helix   262 272 
ADRB2_HUMAN strand  273 285 
ADRB2_HUMAN helix   288 298 
ADRB2_HUMAN helix   305 318 
ADRB2_HUMAN strand  324 327 
ADRB2_HUMAN helix   330 342

ADD COMMENT • link 7.3 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Wow, definitely is worth it to learn how to use XSL!! Just one question about the code, in which part you select the info for the first column (ADRB2_HUMAN)? I don't understand this language yet.

ADD REPLY • link 7.3 years ago by Biojl ★ 1.7k

0

Entering edit mode

n which part you select the info for the first column (ADRB2_HUMAN)

<xsl:value-of select="/p:predictprotein/p:entry/p:accession"/>

ADD REPLY • link 7.3 years ago by Pierre Lindenbaum 166k

Login before adding your answer.