Xml In Bioinformatics, Relevance And Uses
6
13
Entering edit mode
13.9 years ago
Thaman ★ 3.3k

Hi,

We all know the magnitude of XML in different aspects of Web and Application. XML being the foundation for creating document and document system indeed has a great uses and influence on Bioinformatics in real. Examples: PDB xml file, phyloxml which I have encountered till date.

My main concern was regarding bioxml which is not available just some broken link. But, I found something informative regarding XML uses in Bioinformatics application through [Paul Gordon] link[2].

I found this article to be more informative: An XML application for genomic data interoperation.

So my questions are:

  • Is there any Bioxml governing body?
  • How XML have facilitated in general Bioinformatics?
  • Impact and relevance of XML in Bioinformatics?
  • Besides Paul Gordon info, PDB XML file, phyloxml what I am missing?
  • In real how XML schema can be converted into Biodatabases?
  • What are the big hitter Bioxml application developed?
  • In what kind of biological data XML can be set as a standard file format?

Thank you

xml subjective • 7.6k views
ADD COMMENT
11
Entering edit mode
13.9 years ago
User 59 13k

I think perhaps you're missing the point of XML use in Bioinformatics. Any data can be represented in an XML format, really. The point is about interoperability, data exchange and parsing.

One of the things that we have learned from years of flat file formats, is that our parsers break with every format change in a new release. Certainly lots of tools now expect the ouput of BLAST to be in XML format - so that there are no ambiguities parsing it.

Certainly programatically, it is a lot easier to interface with an XML document - and as such MANY biological resources will release some kind of XML representation of the data - this is true for sequence data, proteomics data or microarray data. The representation in XML is generally community agreed, but unlikely to conform to some 'higher' standard, or indeed, bear any resemblance to any other XML resource, other than the fact it is XML.

BioXML appears at some point to have been a collection of DTD's for biology. I was not aware of it's existence until you mentioned it.

ADD COMMENT
0
Entering edit mode

I should have mentioned how greatly it has help in parsing.

ADD REPLY
11
Entering edit mode
13.9 years ago

How XML have facilitated in general Bioinformatics?

From a XML schema, you can quickly generate some source code to read and validate your data. e.g:

xjc -d src -p blast -dtd "http://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd"
xjc "http://www.uniprot.org/docs/uniprot.xsd"

Impact and relevance of XML in Bioinformatics?

XML is just a format, It can make things faster but like any other format, you can't say that it had an impact on science/bioinformatics.

Besides Paul Gordon info, PDB XML file, phyloxml what I am missing?

All the schemas defined by the NCBI: http://www.ncbi.nlm.nih.gov/dtd/

In real how XML schema can be converted into Biodatabases?

It depends what kind of information you want to store in the database. if you just want to put the whole XML document in the database, the you can just use something like BerkeleyDB-XML or eXists.

If you want store the content of your XML into a relational database, you can use XSLT to transform the nodes into a set of SQL queries. e.g: http://code.google.com/p/lindenb/source/browse/trunk/src/xsl/psi2sql.xslt

In what kind of biological data XML can be a standard file format?

Any data can be stored in RDF! See http://bio2rdf.org

ADD COMMENT
0
Entering edit mode

That is very discriptive Pierre, thanks :)

ADD REPLY
0
Entering edit mode

Hi Pierre, just a minor point: in your first point you talk about XML schema, but your example uses a DTD. Do you have a schema-based example, because I think the future is in XML schema not dtd?

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

thanx, didn't use jax before, I think it's really easy to use

ADD REPLY
0
Entering edit mode

"in your first point you talk about XML schema, but your example uses a DTD". This seems to be from the common misconception that DTDs and "schema" are distinct. In fact, DTDs are a kind of schema. From Wikipedia, "An XML schema is a description of a type of XML document, typically expressed in terms of constraints on the structure and content of documents". The misconception comes from the common use of "XML schema" to refer to W3C's XML Schema language, also referred to as "XSD".

ADD REPLY
0
Entering edit mode

@Chris, I agree. It is just semantics :-)

ADD REPLY
7
Entering edit mode
13.9 years ago
lh3 33k

Firstly, +1 to the two posts above. I have learned a lot. Here is an outsider's view:

  • In what kind of biological data XML can be set as a standard file format?

XML is an ideal format to present complex data in small to medium size (e.g. <1GB). It is much clearer and much less error-prone. I see it has the potential to become the standard format for phylogenetic trees, BLAST output, GenBank/EMBL, PDB and network (already been). On the other hand, XML may not be appropriate for very large data. It is also a little overkilling for simple structureless data that can be represented in TAB-delimited formats.

  • How XML have facilitated in general Bioinformatics?

XML usually eases format parsing once you get used to XML. As Daniel said, it avoids a lot of pitfalls in parsing formats, especially those formats without a formal spec (e.g. the standard BLAST output and the output of most programs).

However, as an outsider, I also see the following factors that hamper the adoption of XML in Bioinformatics.

  • Memory. After googling a few XML parser benchmarks (e.g. this one), I have the impression that many stream XML parsers may still use memory larger than the file itself. A few parsers (e.g. libxml2/PULL) implements streaming properly, but they are not the default parsers as in Perl and Python.

  • Speed. Without any evidence, I tend to believe parsing XML is slower than parsing a plain text file. Parsing XML is almost certainly slower than parsing specialized binary formats, probably a lot. This could be a concern for large data sets.

  • Other factors may be Unix unfriendliness and technical complexity, but perhaps once we get used to XML, these are not major concerns. I do not know.

EDIT: Let me elaborate more on the last bullet.

  • Unix unfriendliness. I know there are tools to covert XML to line-based format (I used them). But when we want to open multiple XML files without creating temporary files, it becomes a little painful, though solvable.

  • Technical complexity. I certainly do not mean using a DOM parser is complex, but using a SAX/StAX/PULL parser is more complicated. Another thing I mean by "complexity" is it is overkilling for very simple data.

ADD COMMENT
1
Entering edit mode

Unix unfriendliness: xsltproc is your XML best friend http://xmlsoft.org/XSLT/xsltproc2.html :-)

ADD REPLY
0
Entering edit mode

+1 except for your last bullet point. This is ( 'Unix unfriendliness') just ill defined, and technical complexity? Simply not true anymore (see Pierre's answer), don't really get why you insist on that.

ADD REPLY
0
Entering edit mode

+1 Well mentioned about speed, memory and complexity.

ADD REPLY
0
Entering edit mode

+1 for the last edit

ADD REPLY
7
Entering edit mode
13.9 years ago

Most of the caBIO/caBIG tools at NCI are XML based

One of the best implementations of XML as a biological markup is Tommy Liu's Stanford HIVdb Sierra web service, in which you submit drug resistant HIV sequences programmatically and get back these well structured reports.

I think XML for Bioinformatics by Ethan Cerami is an underrated book on this subject. Despite being roughly 6 years old it still discusses stuff like Axis!

I wrote some errata for this book awhile back.

A rival to XML is YAML.

ADD COMMENT
1
Entering edit mode
12.9 years ago
Hamish ★ 3.3k

For a list of bioinformatics XML formats see: http://www.ebi.ac.uk/Tools/webservices/tutorials/aa_xml_formats

The use of XML has not addressed the proliferation of data formats problem. Although the use of XSLT helps with formats conversion, there are issues with formats that don't quite map. And there is the issue of a lot of the common tools don't support XML as input, even if they do support it for output.

ADD COMMENT
0
Entering edit mode
12.9 years ago
Chris Maloney ▴ 360

I'd suggest that this question is a little bit like asking "Computer files in Bioinformatics -- relevance and uses." XML formats are ubiquitous nowadays. Any attempt to list or catalog all the XML formats used in bioinformatics is obsolete before it's ever finished.

ADD COMMENT

Login before adding your answer.

Traffic: 1939 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6