Question

Is It True That There Is No Parser For Interproscan 5 Xml?

1

Entering edit mode

11.5 years ago

Michael 56k

I have the output of an InterProScan 5 RC7 run in XML format. Unfortunately, I was unable to locate an appropriate parser. The BioPerl parser http://search.cpan.org/~cjfields/BioPerl/Bio/SeqIO/interpro.pm doesn't understand it, it seems to support up to version 4, error message:

no element found at line 206, column 0, byte 14045 at /opt/local/lib/perl5/site_perl/5.12.3/darwin-thread-multi-2level/XML/Parser.pm line 187

Also, this seems to be a known issue: https://redmine.open-bio.org/issues/3452

I was searching for a while but wasn't able to locate:

a parser Bio* library in any language (except ofc generic XML parser, please do not recommend generic XML parsing)
an XSLT stylesheet to convert interproscan 5 RC7 to e.g. interproscan 4 format

The schema definitions are here: http://www.ebi.ac.uk/interpro/resources/schemas/interproscan5 There are two of them, RC1-6 and RC7 schemata.

Maybe if you are among the responsible people for this project, you could also explain:

why this drastic change to a completely different format
without providing a parser or conversion solution
or without informing the Bioperl/python communities?

Edit: Thank you for prooving me wrong, the conversion has been taken care of by the developers already from the beginning. So, we are just lacking native BioPerl/Python support.

This is how my file begins. I think that this is also a mistake because the schema is not referenced (correctly, nothing points to a correct schema version).


<protein-matches xmlns="&lt;a href=" http:="" www.ebi.ac.uk="" interpro="" resources="" schemas="" interproscan5"="" rel="nofollow">http://www.ebi.ac.uk/interpro/resources/schemas/interproscan5">
<protein>
    <sequence md5="alksjfojfjkjhaiuy948iued">MCCXXX ...

• 4.8k views

ADD COMMENT • link 11.5 years ago by Michael 56k

score 3 · Answer 1 · 2014-03-04

InterProScan 5RC7 is rather old now and uses an obsolete version of the InterPro data, so I suggest you upgrade to the current version if possible (see https://code.google.com/p/interproscan/).

You can use the conversion mode (see https://code.google.com/p/interproscan/wiki/InterProScan5ConvertMode) to reformat InterProScan 5 XML into the other InterProScan 5 formats, and the old InterProScan 4 raw format (tab-delimited text) which is used by many tools which supported InterProScan 4. Alternatively since InterProScan 5 can produce GFF3, you should be able to use modules such as Bio::Tools::GFF to parse the InterProScan 5 GFF3 output and get most of the information which available in the InterProScan 5 XML.

If you have feedback, comments, suggestions or issues for InterProScan 5 please direct them to the authors as detailed in the InterProScan documentation: https://code.google.com/p/interproscan/wiki/InterProScan5Feedback. This will ensure they are aware of the issue, and can prioritise any related work based on feedback from the user community.

For what it is worth... I am aware of support for InterProScan 5 output in:

Blast2GO
Galaxy (see http://toolshed.g2.bx.psu.edu/view/bgruening/interproscan5)
Geneious (see https://bitbucket.org/mthon/interproscanplugin)

So they may be useful options to investigate, depending on what you are attempting to do.