Entrez xtract "unrecognized argument '-match'" error
2
2
Entering edit mode
10.1 years ago
Nancy Ouyang ▴ 170

I can't figure out how to use the "-match" syntax even after reading all the documentation I could find. I get these errors:

$ cat xml.txt | xtract -pattern Gene-commentary -match Gene-commentary_type:1

Unrecognized argum`ent '-match'
No -element before 'Gene`-commentary_type:1'

$ cat xml.txt | xtract **-element** Gene-commentary -match Gene-commentary_type:1

Unrecognized argument '-match'
No -element before 'Gene-commentary_type:1'

What am I doing wrong?


What I am trying to do is pull the accession of the reference sequence and the coordinates for the region for a given entry in NCBI Gene (see Retrieve all FASTA RefSeq files for a given entry in NCBI gene?) so that I can run efetch -format FASTA -seqstart -seqend and get the appropriate results.

I could parse the XML in python to do it, but it really seems like I should be able to do this in "one line" using entrez direct if only I could get -match to work :/

Here is what the XML looks like:

Say I have a gene record in XML

epost -db gene -id 672 | efetch -format xml > xml.txt

According to the outline,

cat xml.txt | xtract -outline

 <Gene-commentary>
      <Gene-commentary_type value="genomic">1</Gene-commentary_type>
      <Gene-commentary_heading>Reference assembly</Gene-commentary_heading>
      <Gene-commentary_label>RefSeqGene</Gene-commentary_label>
      <Gene-commentary_accession>NG_005905</Gene-commentary_accession>
      <Gene-commentary_version>2</Gene-commentary_version>
      <Gene-commentary_seqs>
        <Seq-loc>
          <Seq-loc_int>
            <Seq-interval>
              <Seq-interval_from>92500</Seq-interval_from>
              <Seq-interval_to>173688</Seq-interval_to>

I have read:

Attempting To Utilise The New Entrez Direct Package But Having Difficulty With Pubmed And Nucleotide Xml Parsing

http://www.ncbi.nlm.nih.gov/books/NBK179288/ (I followed these instructions to install it, so which epost returns ~/edirect/epost)

http://www.ncbi.nlm.nih.gov/news/02-06-2014-entrez-direct-released/?campaign=facebook-02072014

http://elane.stanford.edu/laneconnex/public/media/documents/EntrezDirect.pdf

entrez ncbi • 3.7k views
ADD COMMENT
4
Entering edit mode
10.1 years ago
hpmcwill ★ 1.2k

From a bit of experimentation with 'xtract' it appears that the order of the command-line arguments is important, and thus a use of '-match' must be followed by an '-element' option. This appears to be the source of the error message you receive.

Using just 'xtract' the closest I've gotten so far is:

cat xml.txt | edirect/xtract \
  -pattern Gene-commentary \
  -match 'Gene-commentary_type:1' \
  -element 'Gene-commentary_accession' Seq-interval

You may be able to further anchor the patterns to make the extraction more specific.

ADD COMMENT
0
Entering edit mode

Thanks for answering my specific question! It's weird their error message says "before" :/ I wasn't sure who to give the checkmark to, but I think Pierre Lindenbaum answered my actual question.

ADD REPLY
2
Entering edit mode
10.1 years ago

Using a good old xslt stylesheet:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
  <xsl:output method="text"/>
  <xsl:template match="/">
    <xsl:for-each select="/Entrezgene-Set/Entrezgene/Entrezgene_locus">
      <xsl:for-each select="Gene-commentary[Gene-commentary_type/@value='genomic' and Gene-commentary_type/text()='1']">
        <xsl:variable name="acn">
          <xsl:value-of select="concat('(',Gene-commentary_heading,')',Gene-commentary_accession)"/>
        </xsl:variable>
        <xsl:for-each select="Gene-commentary_seqs/Seq-loc/Seq-loc_int/Seq-interval">
          <xsl:value-of select="concat($acn,':',Seq-interval_from,'-',Seq-interval_to)"/>
          <xsl:text>
</xsl:text>
        </xsl:for-each>
      </xsl:for-each>
    </xsl:for-each>
  </xsl:template>
</xsl:stylesheet>

run:

xsltproc --novalid transform.xsl  "http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene&id=672&retmode=xml"

output:

(Reference GRCh38 Primary Assembly)NC_000017:43044294-43125482
(Reference assembly)NG_005905:92500-173688
(Alternate CHM1_1.1)NC_018928:41431850-41513017
(Alternate HuRef)AC_000149:36962662-37043808
ADD COMMENT
1
Entering edit mode

Hah, my initial reaction was "what sorcery is this?" Neat, I'd never heard of xslt before. Thanks!

ADD REPLY

Login before adding your answer.

Traffic: 1693 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6