Is There Any Tool To Extract Demanded Information From An Asn/Xml File?
4
2
Entering edit mode
14.2 years ago
Jdk ▴ 20

I have downloaded Homosapiens.ags.gz from NCBI's FTP site ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/ASN_BINARY/Mammalia/ . Homosapiens.ags contains all kinds of information(included in the full report of a gene, e.g. MTF1's full report) of all human genes. Now Homo_sapiens.ags is in XML format, for example:

gene {
locus "A1BG" ,
desc "alpha-1-B glycoprotein" ,
maploc "19q13.4" ,
db {
  {
    db "HGNC" ,
    tag
      id 5 } ,
  {
    db "Ensembl" ,
    tag
      str "ENSG00000121410" } ,
  {
    db "HPRD" ,
    tag
      str "00726" } ,
  {
    db "MIM" ,
    tag
      id 138670 } } ,
syn {
  "A1B" ,
  "ABG" ,
  "GAB" ,
  "HYST2477" ,
  "DKFZp686F0970" } } ,

Is there any convenient tool to extract a gene's related information like 'maploc'? I only need to input 'maploc', then the tool output a file contains all gene names and their related maploc?

xml human gene extraction • 5.2k views
ADD COMMENT
4
Entering edit mode

This is not XML, but Abstract Syntax Notation (ASN).

ADD REPLY
6
Entering edit mode
14.2 years ago

Last week I wrote a complete tutorial "Dumping NCBI Gene as XML: my notebook".

To extract a subpart of an XML file use XSLT (See also this post).

If the XML is too large, you can extract the information using a Stax or a Sax Parser. e.g: here.

If the file, is not XML but ASN.1 as you wrote in your question, you can either use the NCBI toolkit to extract the information (but it won't be easy as you'll have to learn the NCBI API) or you can create a short ASN.1 parser for your "gene" grammar. See here

Hope it helps
Pierre

ADD COMMENT
5
Entering edit mode
14.2 years ago

That file is in ASN1 format, which can be a bit difficult to parse and deal with. You'd likely get some better answers to your question if you describe what you are trying to achieve and what level you are trying to attack the problem at. For instance, are you are programmer? If so, what languages do you use? What biological information are you interested in?

Based on your query alone, a good approach might be to try the BioMart tool at Ensembl. This provides a web interface to build up and retrieve the type of data you are interested in. For example, here is a query to get gene names, chromosomes and map locations. Click on Results to see the table of information. You can download in tab delimited format for further manipulation.

ADD COMMENT
4
Entering edit mode
14.2 years ago

Due to the hierarchical nature shared between ASN and XML, I have found it very useful to parse ASN data as XML data in memory and then query the data using Xpath. PHP's simplexml libraries are my favorite when processing XML.

ADD COMMENT
1
Entering edit mode
13.8 years ago

For these kind of tasks I am using xmlstarlet. To get all maploc of genes you would typically do:

cat Homo_sapiens.ags.xml |xmlstarlet sel -t -m //maploc -v . -o ": " -v ../locus -n

The output would be: 19q13.4: A1BG.

I have used xmlstarlet to learn XSLT since adding -C would output the XSLT file that does the same job.

cat Homo_sapiens.ags.xml |xmlstarlet sel -C -t -m //maploc -v . -o ": " -v ../locus -n
ADD COMMENT

Login before adding your answer.

Traffic: 1686 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6