How To Convert Xml Into A Decent Parseable Format?
8
9
Entering edit mode
13.7 years ago
Lyco ★ 2.3k

Please excuse the provocative title, I am well aware that for most of you out there XML is the best thing since sliced bread. However, I am stupid and I don't understand XML - or rather, I do understand the concept but I find it difficult to operate.

Nowadays, many data collections present their data in XML format rather than the old-fashioned tab-delimited files (or tag/value pairs or other formats that are easy to grasp and to parse). I am not talking about data with a horribly complicated internal structure that really requires XML - I am talking about data that would be perfectly suited for a tab-delimited format. Something like

gene1 prediction1 pvalue1 prediction2 pvalue2

Can anybody give advice how I can convert such an XML into something simpler? If possible, something that does not require me to learn SQL/Java/Python. A stand-alone converter would be optimal, but a Perl module would also be helpful. Some web searches pointed me to LibXML.pl but I found this mostly useless.

On a sidenote, it would be great if somebody could explain the rationale for presenting date like the above example in XML format. A simple tab-delimited format allows me to visually inspect the data, to use grep and awk to filter the data, to use cut for extracting subsets, and to do lots of other things that cannot be done in XML. Given the widespread use of XML, there must be a good reason to use it. Or is it just to prevent the stupid wet-lab people from reading the data :-)

xml parsing perl subjective • 20k views
ADD COMMENT
2
Entering edit mode

@Lycopersicon. If you have to write a table, tab delimited is fine. But if you have to write the output of a blast, tab delimited is terrible! XML, on contrary, can do it. However, you could write the XML file wrong (forgetting a [?] would make parser go mad), so you can "validate" the XML to make sure it conforms to what it should look like (in case you make your own blast output, for instance). If it is "valid" everybody knows how to handle it. If tab delimited you forget a tab or put two in a row....

ADD REPLY
1
Entering edit mode

You mean "into another decent parseable format"? Because, provocation aside, an XML document format (with a specific XML-Schema) is a decent parseable (even automatically parseable) format. What you didn't know, is that you tab-delimited output 'format' is actually not a format! That said, this is the main advantage of XML, it allows to specify flexible complex document formats against which the validity of a document can be automatically validated, documents can be parsed and transformed. For the solution Pierre's answer is also demonstrating the advantages of XML.

ADD REPLY
0
Entering edit mode

Hey, this line with the 'decent format' was supposed to be a joke!. But at least I got your attention. Pierre's approach looks interesting and I will see if it gets me anywhere. And sorry if I called 'tab-delimited files' a format. Please remember, I am a biologist, so you have to talk SLOWLY!

I also fail to appreciate the importance of 'validation' that the XML folks talk so much about. To me, XML looks so obfuscated that it NEEDs a validation, while with a simple format you can see right away if its ok. But this is probably just because I don't understand what validation is. Sigh.

ADD REPLY
0
Entering edit mode

@Stefano, for the last 10 years, I have been living happily with blast -m8 (you know what I mean)

ADD REPLY
6
Entering edit mode
13.7 years ago

I'm going to assume you're mostly blowing off steam. I spend lots of time upset at software myself. At one level I feel your pain, since I often prefer simple text formats to XML, provided they're well-defined. However, doing any sufficiently complicated task requires some level of specialized knowledge. Could I walk into your lab, grab some cells out of the hood, dump them into the gel with a sprinkle of antibody, and expect my western to work in 20 minutes? What, you mean there's a protocol? It takes more than 20 minutes? That's so complicated! This stuff should just work!

Tools could always be better, and we're happy to help where we can.

ADD COMMENT
0
Entering edit mode

Actually, it's not so much about steam. Assume that I'am interested in the LOCATE database mentioned above. I wan't to get a list of all proteins predicted to be nuclear by methods X and Y (not caring about method Z). I could use the web server submitting one protein at a time. Or I could download the dataset. If the dataset were a table, I would have the result in 1 second using 1 line of awk (or Excel). But the table is in XML and I assumed that there must be an easy way to extract the required information. I didn't find anything on the web, so I asked here. Seems like I have to learn xslt.

ADD REPLY
0
Entering edit mode

David, would you follow a western-protocol that takes longer if it gives the same end-result? If a tab-file covers 80% of the possible use cases of a data-base it might cover 99% of the real users. I would argue that any database should be available as SQL or RDF/XML AND TAB to be accessible to as many people as possible.

ADD REPLY
0
Entering edit mode

I'm not arguing in favor of XML Everywhere. It has a place in the world, but when someone hands me XML for something that should be a flat file, now I have two problems. Still, a lot of bioinformatics is setting up data, just as a lot of labwork is prepping reagents (breed the mouse, purify the protein, etc). It's part of the process.

ADD REPLY
6
Entering edit mode
13.7 years ago

My answer on A: How To Filter A Blast Xml Output ? does apply here as well

ADD COMMENT
0
Entering edit mode

Great. this (almost) solves my problem! Many thanks !!!

ADD REPLY
5
Entering edit mode
13.7 years ago
Lyco ★ 2.3k

Andra's idea to use biostarlet turned out to be extremely helpful. I must admit that I had problems understanding the documentation at http://xmlstar.sourceforge.net/doc/UG/xmlstarlet-ug.html but Andra's example on parsing BLAST output and some trial & error did the job. Mostly, that is.

I would like to summarize what I learned about the use of xmlstarlet, because it might be useful for other dummies like me, who are facing similar problems. For that purpose, let us assume a simple XML file (inspired by the structure of LOCATE, but greatly simplified)


<TEST_doc xmlns:xsi="&lt;a href=" http:="" www.w3.org="" 2001="" XMLSchema-instance"="" rel="nofollow">http://www.w3.org/2001/XMLSchema-instance">
  <ENTRY uid="123456">
    <protein>
      <name>PROT001</name>
      <organism>Human</organism>
      <class>cytoplasmic</class>
    </protein>
    <xrefs>
      <xref>
          <database>Ensembl</database>
          <accn>ENSG00000105829</accn>
      </xref>
    </xrefs>
  </ENTRY>
  <ENTRY uid="45678">
    <protein>
      <name>PROT002</name>
      <organism>Human</organism>
      <class>nuclear</class>
    </protein>
    <xrefs>
      <xref>
          <database>Ensembl</database>
          <accn>ENSG00000105333</accn>
      </xref>
    </xrefs>
  </ENTRY>
</TEST_doc>

This XML file describes two database entries (each of them bracketed by <ENTRY> </ENTRY>). Each entry has a name, an associated organism and a 'class' telling us something about the localization of the protein. Each entry also contains a section called <xrefs> which might contain cross-references to other databases. In this simple example, there is only one such xref per entry.

For following Andra's suggestion, I need the xmlstarlet command-line program, which I could install under Ubuntu by apt-get install xmlstarlet.

For getting a first overview of the XML structure, I have tried

cat simple.xml | xmlstarlet el

which gave the following output (XML structure without data):

TEST_doc
TEST_doc/ENTRY
TEST_doc/ENTRY/protein
TEST_doc/ENTRY/protein/name
TEST_doc/ENTRY/protein/organism
TEST_doc/ENTRY/protein/class
TEST_doc/ENTRY/xrefs
TEST_doc/ENTRY/xrefs/xref
TEST_doc/ENTRY/xrefs/xref/database
TEST_doc/ENTRY/xrefs/xref/accn
TEST_doc/ENTRY
TEST_doc/ENTRY/protein
TEST_doc/ENTRY/protein/name
TEST_doc/ENTRY/protein/organism
TEST_doc/ENTRY/protein/class
TEST_doc/ENTRY/xrefs
TEST_doc/ENTRY/xrefs/xref
TEST_doc/ENTRY/xrefs/xref/database
TEST_doc/ENTRY/xrefs/xref/accn

The next step was to extract the data fields that are of interest to me. With xmlstarlet, this can apparently be done by

cat simple.xml | xmlstarlet sel -t -m //ENTRY -v "concat(field1,' ',field2)" -n

where 'sel -t' tells the program to go into field extraction mode, '-m //ENTRY' means that the subsequent part is applied to each ENTRY (I have no idea what the initial slashes are good for), and the last part indicates the fields that are to be extracted (-n generates a linebreak after each entry). The syntax of the field specification is special: it appears to be a 'relative path' from the bit that has been matched by the -m option. For example, ./protein/name means TEST_doc/ENTRY/protein/name (relative to ENTRY)

The command

cat simple.xml | xmlstarlet sel -t -m //ENTRY -v "concat(./protein/name,' ',./protein/class,' ',./xrefs/xref/database,' ',./xrefs/xref/accn)" -n

results in the output

PROT001 cytoplasmic Ensembl ENSG00000105829
PROT002 nuclear Ensembl ENSG00000105333

which is exactly what I need.

BUT, there remains a little problem. If the XML is a little more complex than the above example, e.g. because each entry has two xref sections pointing to different databases:


<TEST_doc xmlns:xsi="&lt;a href=" http:="" www.w3.org="" 2001="" XMLSchema-instance"="" rel="nofollow">http://www.w3.org/2001/XMLSchema-instance">
  <ENTRY uid="123456">
    <protein>
      <name>PROT001</name>
      <organism>Human</organism>
      <class>cytoplasmic</class>
    </protein>
    <xrefs>
      <xref>
          <database>Ensembl</database>
          <accn>ENSG00000105829</accn>
      </xref>
      <xref>
          <database>UNIPROT</database>
          <accn>Q12345</accn>
      </xref>
    </xrefs>
  </ENTRY>
  <ENTRY uid="45678">
    <protein>
      <name>PROT002</name>
      <organism>Human</organism>
      <class>nuclear</class>
    </protein>
    <xrefs>
      <xref>
          <database>Ensembl</database>
          <accn>ENSG00000105333</accn>
      </xref>
      <xref>
          <database>UNIPROT</database>
          <accn>Q14789</accn>
      </xref>
    </xrefs>
  </ENTRY>
</TEST_doc>

In this case, the same xmlstarlet call will retrieve only the first xref it encounters:

cat complex.xml | xmlstarlet sel -t -m //ENTRY -v "concat(./protein/name,' ',./protein/class,' ',./xrefs/xref/database,' ',./xrefs/xref/accn)" -n

results in the output

PROT001 cytoplasmic Ensembl ENSG00000105829
PROT002 nuclear Ensembl ENSG00000105333

and the xref to the uniprot database is not listed in the output.

After doing some experiments, I found that this problem can be circumvented by matching on 'xref' rather than the more intuitive 'ENTRY'. When specifying the reported fields relative to the match, it is possible to 'go backwards' by using the unix-style '../..' paths.

the command

cat complex.xml | xmlstarlet sel -t -m //xref -v "concat(../../protein/name,' ',../../protein/class,' ',./database,' ',./accn)" -n

gave me the desired output

PROT001 cytoplasmic Ensembl ENSG00000105829
PROT001 cytoplasmic UNIPROT Q12345
PROT002 nuclear Ensembl ENSG00000105333
PROT002 nuclear UNIPROT Q14789

In the real LOCATE database, things are even more complicated, as each entry can have multiple xrefs and also multiple predictions. I guess that in those cases, I will have to do two or more xmlstarlet extractions and combine the resulting tables later by some sed/awk/join magic.

Anyway, I consider my problem solved. Many thanks to all who have answered!

ADD COMMENT
0
Entering edit mode

Oops, I added mine before I saw this.

ADD REPLY
0
Entering edit mode

// mean fit any. Be aware of this. Sometimes nodes have similar names in a different context. [?] [?]1234566[?] [?] [?]ABC[?] [?]87643839AB[?] [?] [?] Selecting //id here would return both identifier, while they have different meaning. In this case use the exact path as you extract with the xmlstarlet el option.

One of the benefits of xml is that the context is there. In tab delimited files you have to guess what the meaning of a value is. In most cases this is given in the header, but one dimensional

ADD REPLY
0
Entering edit mode

I just noticed such a case of ambiguity in the LOCATE database. In my case, I did not circumvent it by using the exact path but by using a bigger (multi-level) part (as in -m //lev1/lev2/lev3). It turns out that the fields had to be specified relative to the deepest level of the match (lev3) - I assume this is always the case.

ADD REPLY
4
Entering edit mode
13.7 years ago

If your input looks like this

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<table>
<row>
    <result>
        <gene>A1</gene>
        <prediction>B1</prediction>
        <p-value>C1</p-value>
    </result>
    <result>
        <gene>A2</gene>
        <prediction>B2</prediction>
        <p-value>C2</p-value>
    </result>
</row>
<row>
    <result>
        <gene>E1</gene>
        <prediction>F1</prediction>
        <p-value>G1</p-value>
    </result>
    <result>
        <gene>E2</gene>
        <prediction>F2</prediction>
        <p-value>G2</p-value>
    </result>
</row>
</table>

You can transform it to TSV with a XSLT stylesheet without any (perl/java/python...) programming:


<xsl:stylesheet xmlns:xsl="&lt;a href=" <a="" href="http://www.w3.org/1999/XSL/Transform" rel="nofollow">http://www.w3.org/1999/XSL/Transform" "="" rel="nofollow">http://www.w3.org/1999/XSL/Transform'
        version='1.0'
        >
<xsl:output method="text"/>

<xsl:template match="/">
<xsl:apply-templates select="table/row"/>
</xsl:template>

<xsl:template match="row">
<xsl:apply-templates select="result[1]"/>
<xsl:text>    </xsl:text>
<xsl:apply-templates select="result[2]"/>
<xsl:text>
</xsl:text>
</xsl:template>

<xsl:template match="result">
<xsl:value-of select="gene"/>
<xsl:text>    </xsl:text>
<xsl:value-of select="prediction"/>
<xsl:text>    </xsl:text>
<xsl:value-of select="p-value"/>
</xsl:template>



</xsl:stylesheet>

:

xsltproc file.xsl file.xml 
A1    B1    C1    A2    B2    C2
E1    F1    G1    E2    F2    G2

I agree you, for this simple kind of data, XML (or JSON) can be overrated. But with XML, you can:

  • validate your input
  • add semantics to your data
  • store some complex data (blast, pubmed record, psmi, etc...)
  • transform a complex input to something else
  • generate a (validating) parser
  • etc...

See also this question: Xml In Bioinformatics, Relevance And Uses

ADD COMMENT
0
Entering edit mode

Hmm, great. But where do I get the 'XSLT stylesheet' from? Am I supposed to read the 100+ page manual that you linked? Imagine me as a slightly retarded biologist who tries to extract some data from a source like http://locate.imb.uq.edu.au which is explicitly targetted at the biologist but apparently requires me to learn about stylesheets, XSLT and other stuff before telling where the proteins are located. Don't worry about LOCATE, this is just an example. I have written a parser in Perl, which took me many hours, only to find that the data wasn't worth it. At least the XML was valid.

ADD REPLY
0
Entering edit mode

Actually, I did read this post before posting, but it didn't really help. You mentioned something called xjc (googling didn't find anything on xjc that would have been intelligible to me). And you talked about 'validation', of which I don't see the point. To people like me, validation means 'doing the same experment twice hoping that the result is reproducible'. How should XML protect me from having invalid data?

ADD REPLY
0
Entering edit mode

xjc generates a java code to parse a defined XML structure.

ADD REPLY
0
Entering edit mode

validation: you might need this to check if your users are sending their data in the right format.

ADD REPLY
0
Entering edit mode

"where do I get the XSLT stylesheet' from" ,"... xjc that would have been intelligible to me", "and you talked about 'validation', of which I don't see the point."... I'm sorry, but if you want to use/understand a tool or a technology, you have to learn how to use it, isn't it ? (or you can ask someone else to do the job).

ADD REPLY
0
Entering edit mode

interestingly, there is an error in the schema for LOCATE http://locate.imb.uq.edu.au/info_files/LOCATE_v6.xsd ( unclosed xsd:attribute )

ADD REPLY
0
Entering edit mode

... and IMHO, it is not possible to "quickly" learn how play with XML.

ADD REPLY
0
Entering edit mode

Pierre, please don't feel offended. You have invested far more work into this stupid question than anybody could ask for. My point is that I as a biologist don't want to learn the technology, I want to see the data in a format that I can use.

Imagine that you get a new digital camera, just to find out that it won't let you look at the pictures but gives you some numerical format instead. When you start complaining, everybody keeps telling you that the numerical format is more powerful because it can store additional information and you can easily see if all pixels are valid.

ADD REPLY
0
Entering edit mode

BTW, thanks for looking into LOCATE. The error in the schema is not the only problem. I am more worried about the errors in the data. Many proteins are located elsewhere than what LOCATE tells me.

ADD REPLY
0
Entering edit mode

no problem , I wasn't offended at all :-)

ADD REPLY
3
Entering edit mode
13.7 years ago

XML is more than tab delimited. From a XML you can produce a tab delimited but not the other way round.

Furthermore, with a tab delimited text, you have to parse each of them knowing the internal structure (how many columns, is there a header, what each column mean...) and you are limited with a table. A very basic structure. Furthermore, if lines are very long they cannot span across several lines... As soon as things become a bit more complicated, some fields of the table store several values, (coma delimited) and some of these are made by values (semicolon delimited). A nightmere.

I am working with XML right now and, in Perl, you can do:

use XML::Simple;
use Data::Dumper;
my $in = XMLin("$xmlFile", KeepRoot => 1);
print Dumper($in);

and you have it parsed in a hash reference and then printed following the structure!

You won't need to parse any text file anymore. Just start from the parsed input.

And this is just the very basic. There are tons of little trick you can do with XML!

Of course everithing has a cost. XML is not so fast, and kind of force you to learn about data structures (which is a good thing!)

ADD COMMENT
0
Entering edit mode

Stefano, Thanks for the lead. If I am not mistaken, I came across XML::Simple when searching the web for a solution, and I even did a few experiments (which failed miserably). I tried to read the documentation, which told me something about working with different 'name spaces' and this got me all confused. I somehow expected that it would be straightforward to have a Perl module for reading a XML file, and then giving me direct access to the different levels of the XML structure, e.g. as in $gene->$prediction->$pvalue It turned out not to be that simple.

ADD REPLY
3
Entering edit mode
13.7 years ago

Uh, I'm kind of on your side on this one. Keep it simple, stupid. XML seems like overkill in this case.

When I've dealt with parsing XML, I've used perl module XML::Simple

Here's a little function someone preceding me wrote in perl to read an XML file into a perl hash. Also uses Data::Dumper. It seems to work.

# Read XML file into a hash. Return 1 if successfully completed
# Check for zero padding at end of file
sub readFileXML
{
    my ($fileName)=@_;
    return undef unless (-f $fileName);
    unless(open (FILE, $fileName))
    {
       warn "Failed to open $fileName: $!"; return undef;
    }
    my $text="";
    while (<FILE>)
    {
      $text.=$_;
    } 
    close (FILE);

    my $l=length($text);

    while (($l!=0)&&(substr($text,$l-1,1) eq "\0"))
    {
       $l--;
    } 

    if ($l==0)
    {
       warn "file $fileName seems to be completely blank";
       return undef;
    } 

    $text=substr($text,0,$l) unless ($l==length($text));

    my $data = $xml->XMLin($text);

    return $data;
} 

$xmlData=readFileXML($summaryFile));

#then you can do
print Dumper($xmlData); #to see what's in it, or just look in the file
my $runFolder = $xmlData->{ChipSummary}->{RunFolder}; #in our case, the illumina Summary.xml files contain these types of fields, yours will be different...
ADD COMMENT
1
Entering edit mode

I love that even out here in Seattle, half the answers I find on here are still from you. Hi Madelaine! :)

ADD REPLY
1
Entering edit mode

Awesome! Hi Summer! Nice seeing you around the internets. Sorry you have to deal with XML.

ADD REPLY
1
Entering edit mode

Also, since this I have used the R package for XML which also works well.

ADD REPLY
2
Entering edit mode
13.7 years ago
Sdk ▴ 20

An engineer once told me that if your VCR (this was a while ago) is blinking 12:00 12:00 12:00 then you're an idiot. If everyone's is blinking 12:00 12:00 12:00 then the engineer is an idiot.

A dry-lab biologist shouldn't expect you to parse that, and you shouldn't expect him (her) to pipette.

ADD COMMENT
0
Entering edit mode
13.7 years ago
Jrbeaman ▴ 30

Yeah, try and put html or javascript code into an XML format, and watch everything blow up. Why do we need to restate the same field identifyer over and over when a tab file does just fine in 1/100 the space?

ADD COMMENT

Login before adding your answer.

Traffic: 3483 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6