Parse Ncbi Eutils Xml To Csv
2
1
Entering edit mode
12.1 years ago

Do tools exist for parsing specifically NCBI Eutils XML to CSV?

This is related to tools parsing NCBI blast -m 7 xml output format?

An example file is available at : https://raw.github.com/low-decarie/stack_examples/master/ncbi_example.xml

I would like to associate the accession numbers with basic metadata on the study (title,alt_elev,environment,envo_biome,envo_feature,lat,lon,lat_long, number of reads).

xml eutils ncbi • 4.8k views
ADD COMMENT
1
Entering edit mode

your XML for SRA is a structured/complex document: it cannot be converted to CSV unless you explain us what kind of CSV data is expected, or what you want to achieve with it.

ADD REPLY
2
Entering edit mode
12.1 years ago

Given that this is SRA xml, you might consider using SRAdb (Bioconductor). There is a separately packaged sqlite database into which we have placed all the SRA metadata. You can use the R package to do the queries or simply use any sqlite client on the same sqlite file. Documentation for the R package includes multiple example queries.

http://www.bioconductor.org/packages/release/bioc/html/SRAdb.html

ADD COMMENT
0
Entering edit mode

That is excellent! Wish I had know of this sooner, but I currently have all the XML files on hand and would like to parse those (into a csv or into an sqlite db would be fine. Thank you

ADD REPLY
2
Entering edit mode
12.0 years ago

Inserting a complex XML like this is complicated. You should rather keep the whole XML structure and query it with xslt or xquery. However, here is a XSLT stylesheet as an example that would insert 3 tables in sqlite3. Morevover, the single quotes in the TEXT field should be escaped.

<?xml version='1.0' encoding="ISO-8859-1"?>
<xsl:stylesheet xmlns:xsl='http://www.w3.org/1999/XSL/Transform' version='1.0'
>
<xsl:output method="text"/>
<xsl:template match="/">
create table if not exists EXPERIMENT
(
id INTEGER PRIMARY KEY AUTOINCREMENT,
alias TEXT,
center_name TEXT,
accession TEXT
);
create table if not exists DESIGN
(
id INTEGER PRIMARY KEY AUTOINCREMENT,
experiment_id INTEGER
);
create table if not exists LIBRARY_DESCRIPTOR
(
id INTEGER PRIMARY KEY AUTOINCREMENT,
name TEXT,
design_id INTEGER
);
BEGIN TRANSACTION;
<xsl:apply-templates select="EXPERIMENT_PACKAGE_SET/EXPERIMENT_PACKAGE/EXPERIMENT"/>
COMMIT TRANSACTION;
</xsl:template>
<xsl:template match="EXPERIMENT">
insert into EXPERIMENT(alias,center_name,accession) values
('<xsl:value-of select='@alias'/>','<xsl:value-of select='@center_name'/>','<xsl:value-of select='@accession'/>');
<xsl:apply-templates select="DESIGN"/>
</xsl:template>
<xsl:template match="DESIGN">
insert into DESIGN(experiment_id) select MAX(id) from EXPERIMENT;
<xsl:apply-templates select="LIBRARY_DESCRIPTOR"/>
</xsl:template>
<xsl:template match="LIBRARY_DESCRIPTOR">
insert into LIBRARY_DESCRIPTOR(name,design_id) select '<xsl:value-of select='LIBRARY_NAME'/>',MAX(id) from DESIGN;
</xsl:template>
</xsl:stylesheet>

Usage:

xsltproc sra2sqlite.xsl sra.xml | sqlite3 sra.sqlite

.

~$ sqlite3 sra.sqlite ' select * from EXPERIMENT limit 10'
1|NSZ2O1P00|MCDB, U of Colorado|SRX159081
2|NSZ1O1P00|MCDB, U of Colorado|SRX159080
3|NSY2O1P00|MCDB, U of Colorado|SRX159079
4|NSU3O1P00|MCDB, U of Colorado|SRX159078
5|NSU2O1P00|MCDB, U of Colorado|SRX159077
6|NST2O1P00|MCDB, U of Colorado|SRX159076
7|NST1O1P00|MCDB, U of Colorado|SRX159075
8|NSQ3O1P00|MCDB, U of Colorado|SRX159074
9|NSP3O1P00|MCDB, U of Colorado|SRX159073
10|NSP1O1P00|MCDB, U of Colorado|SRX159072
ADD COMMENT
0
Entering edit mode

Oracle/Berkeley XML db is quite useful for storing and querying xml. Here is a reasonably nice (though dated) walkthrough:

http://www.xml.com/pub/a/2008/05/07/under-the-hood-oracle-berkeley-db-xml.html

ADD REPLY
0
Entering edit mode

@Sean I tested BerkeleyXML and eXist: the second was lighter and easier to use.

ADD REPLY

Login before adding your answer.

Traffic: 1781 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6