Entrez Direct E-utilities - using match and xtract to filter by data value
2
0
Entering edit mode
8.1 years ago
al-ash ▴ 210

I'm tryin to use Entrez direct to extract "Gene-commentary_accession" information from xml file using:

esearch -db gene -query XP_003399880.1| efetch -format xml | xtract -pattern Gene-commentary  -match Gene-commentary_type:1 -element Gene-commentary_accession > Bter_FAR_genome_shotgun_sequences2.txt

an example of XML file (shortened):

  <Entrezgene_locus>
    <Gene-commentary>
      <Gene-commentary_type value="genomic">1</Gene-commentary_type>
      <Gene-commentary_heading>Reference Bter_1.0</Gene-commentary_heading>
      <Gene-commentary_label>Chromosome LG B12 Reference Bter_1.0</Gene-commentary_label>
      <Gene-commentary_accession>NC_015773</Gene-commentary_accession>
      <Gene-commentary_version>1</Gene-commentary_version>
      <Gene-commentary_seqs>
        <Seq-loc>
          <Seq-loc_int>
            <Seq-interval>
              <Seq-interval_from>7277254</Seq-interval_from>
              <Seq-interval_to>7286174</Seq-interval_to>
              <Seq-interval_strand>
                <Na-strand value="minus"/>
              </Seq-interval_strand>
              <Seq-interval_id>
                <Seq-id>
                  <Seq-id_gi>339751241</Seq-id_gi>
                </Seq-id>
              </Seq-interval_id>
            </Seq-interval>
          </Seq-loc_int>
        </Seq-loc>
      </Gene-commentary_seqs>
      <Gene-commentary_products>
        <Gene-commentary>
          <Gene-commentary_type value="mRNA">3</Gene-commentary_type>
          <Gene-commentary_heading>Reference</Gene-commentary_heading>
          <Gene-commentary_label>transcript variant X1</Gene-commentary_label>
          <Gene-commentary_accession>XM_003399832</Gene-commentary_accession>
          <Gene-commentary_version>2</Gene-commentary_version>
          <Gene-commentary_genomic-coords>

I'd like to retrieve the genomic accession using -match command but I still keep extracting also other Gene-commentary_accessions such as "mRNA" - could you help me with a correct syntax?

(I find it quite difficult to comprehend the use of -match from the NCBI's documentation for this topic (https://www.ncbi.nlm.nih.gov/books/NBK179288/) so another example on Biostars might possibly help also others with similar question.)

Entrez Direct E-utilities match xtract • 4.3k views
ADD COMMENT
0
Entering edit mode

(deleted - misplaced comment)

ADD REPLY
2
Entering edit mode
8.0 years ago
DCGenomics ▴ 330

Xtract 5.50, part of today's EDirect release, has better methods for handling recursive objects, with two specific improvements:

1) Nested exploration (e.g., "*/Gene-commentary") masks deeper objects from being seen by the -element selection command. It is no longer necessary to use -first instead of -element to exclude information from lower levels.

2) Recursive exploration (e.g., "**/Gene-commentary") flattens the recursive structure, visiting every indicated object regardless of depth. The same -element masking applies here.

In addition, the -match and -avoid commands, along with the "object:value" selection construct, have been deprecated, so that colon can be used to indicate namespace prefixes.

Conditional execution now uses -if and -unless commands, and has compound statements for string comparison (e.g., -contains) or numeric comparison (e.g., -lt).

Retrieving genomic accessions from Bombus terrestris can be done with:

   esearch -db gene -query XP_003399880.1 |
   efetch -format xml |
   xtract -pattern Entrezgene -block "**/Gene-commentary" \
     -if Gene-commentary_type@value -equals genomic \
       -tab "\n" -element Gene-commentary_accession |
   sort | uniq

This returns two accessions:

   AELG01001811
   NC_015773

Note that the efetch.fcgi "id" argument should have rejected a non-integer (accession) value sent to the gene database. This oversight has been reported to the program's maintainers. EDirect's efetch front-end now issues an error message if an accession is passed to -id and the -db argument is not a sequence database.

Please update to the latest version of EDirect by rerunning the download instructions in:

https://www.ncbi.nlm.nih.gov/books/NBK179288/

ADD COMMENT
0
Entering edit mode
8.1 years ago

I'd like to retrieve the genomic accession using -match command but I still keep extracting also other Gene-commentary_accessions such as "mRNA" - could you help me with a correct syntax?

yes, because Gene-commentary can contain some other Gene-commentary_type;

Use a xslt stylesheet instead of edirect or an xpath expr see below, :

$ curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene&id=XP_003399880.1&retmode=xml&rettype=db" | xmllint --xpath '//Gene-commentary[Gene-commentary_type/text()="1" ]/Gene-commentary_accession' - | cat | tr "<>" "\n" | grep -vF 'Gene-commentary_accession' | grep -v '^$' | sort | uniq
AC_000062
AC_000151
AC010642
AC012313
AMYH02036533
CH471135
CP000040
NC_000019
NC_007103
NC_018930

(ugly, a xslt stylesheet would be better)

ADD COMMENT
0
Entering edit mode

Pierre, thanks for suggesting an alternative solution.

I'm aware of the multiple gene commentary types and therefore I tried to specify it by using -match Gene-commentary_type:1 but apparently I don't have the syntax right. I still hope to make it working because otherwise I'm satisfied with edirect and it works for my other tasks.

Btw. in your code, I'm wondering what is going on with the link https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene&id=XP_003399880.1&retmode=xml&rettype=db because it leads to an xml of a human protein despite XP_003399880.1 being completely different insect protein (https://www.ncbi.nlm.nih.gov/protein/XP_003399880.1) ?

ADD REPLY

Login before adding your answer.

Traffic: 2590 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6