Help with analyzing NCBI tissue expression data - solr xml file
1
I'm trying to access the complete ncbi tissue expression dataset. When you look at any individual gene, NCBI provides a
expression chart to look at rna-seq counts across tissues. For example, https://www.ncbi.nlm.nih.gov/gene/6304
You can see higher expression in brain and lymph nodes.
I contacted ncbi, and they showed me the data for all genes is accessible here: https://ftp.ncbi.nih.gov/gene/DATA/expression/
The data is a giant xml file that's formatted for solr Apache databse. They provide a schema file to help read the data.
However, my first attempt at loading the data into solr totally failed. Has anyone set up scripts for loading and querying this data?
solr
ncbi
tissue expression
xml
RNA-Seq
• 857 views
The XML file is buggy, there is no XML root element.
Download it: wget "https://ftp.ncbi.nih.gov/gene/DATA/expression/Mammalia/Homo_sapiens/PRJEB2445_GRCh38.p2_107_expression.xml.gz "
fix the xml by adding a root element.
(echo "<root>" && gunzip -c PRJEB2445_GRCh38.p2_107_expression.xml.gz && echo "</root>" ) > tmp.xml
process with an XSLT stylesheet below to generate a table. (slow and memory consumming)
xsltproc biostar492866.xsl tmp.xml
ouput:
entropy exp_Mcount exp_rpkm exp_total full_rpkm gene id is_metadata is_sample project_desc sample_id source_name sra_id taxid var
16177.9 metadata_9606_SAMEA962332 true true PRJEB2445 SAMEA962332 thyroid ERS025090 9606
16970.1 metadata_9606_SAMEA962333 true true PRJEB2445 SAMEA962333 testes ERS025094 9606
17645.9 metadata_9606_SAMEA962334 true true PRJEB2445 SAMEA962334 prostate ERS025095 9606
15620.7 metadata_9606_SAMEA962335 true true PRJEB2445 SAMEA962335 liver ERS025096 9606
17816.5 metadata_9606_SAMEA962336 true true PRJEB2445 SAMEA962336 white blood cells ERS025091 9606
24649.8 metadata_9606_SAMEA962337 true true PRJEB2445 SAMEA962337 16 tissues mixture ERS025093 9606
17701.4 metadata_9606_SAMEA962338 true true PRJEB2445 SAMEA962338 lung ERS025099 9606
19777.8 metadata_9606_SAMEA962339 true true PRJEB2445 SAMEA962339 adipose ERS025098 9606
18398.2 metadata_9606_SAMEA962340 true true PRJEB2445 SAMEA962340 breast ERS025088 9606
but the best way to process such big xml file is to use a STAX or a SAX parser. ( A: Is There Any Tool To Extract Demanded Information From An Asn/Xml File? Convert XML file to FASTA ... )
Login before adding your answer.
Traffic: 2080 users visited in the last hour
Wow that worked perfectly! Thanks.
close the question validate my answer by clicking the green tick on the left please.