Question

Attempting To Utilise The New Entrez Direct Package But Having Difficulty With Pubmed And Nucleotide Xml Parsing

0

Entering edit mode

10.8 years ago

Daniel ★ 4.0k

The new tool appears to do exactly what I want and I was keen to try it out, but I'm having some dificulty.

I am attempting to pull out sequences from a taxon tree matching a single gene and give me a table of Accession, Author(s), Affiliation, Title. This is to give to collaborators for them to authenticate trusted sources, and I will pull out the chosen fasta sequences at a later date.

When parsing pubmed records the documentation is quite clear, and I can confirm it works for me:

esearch -db pubmed -query "Garber ED [AUTH] AND PNAS [JOUR]" | elink -related | efilter -query "mouse" | efetch -format docsum | xtract -pattern DocumentSummary -element Id SortFirstAuthor Title

I am attempting to search the nucleotide database, but I cannot return the 'Authors' or other details using the 'xtract' command, and I can't find any examples on doing so

My best attempt is as follows, but it only gives the Id:

esearch -db nucleotide -query "txid2836[Organism:exp] AND rbcl[GENE]" | efetch -format docsum | xtract -pattern DocumentSummary -element Id Authors

Alternatively, I have been attempting to use efetch -format xml and xtract-ing the information from there, but I can't understand how to select the correct hierarchy level (documentation):

The xtract function is used for processing XML data:

Exploration Argument Hierarchy
-pattern       (Highest Rank)
-division
-group
-branch
-block
-section
-subset
-unit          (Lowest Rank)

One such attempt looks like this:

esearch -db nucleotide -query "txid2836[Organism:exp] AND rbcl[GENE]" | efetch -format xml | xtract -division Authors -unit Name

entrez eutils xml parsing • 5.0k views

ADD COMMENT • link updated 10.8 years ago by Neilfws 49k • written 10.8 years ago by Daniel ★ 4.0k

score 0 · Answer 1 · 2014-03-04

Currently, I'm unable to get your second example to run to completion. So I'm trying a simpler query:

esearch -db nucleotide -query "NM_182762.3"

The first step is to run xtract with the -outline option, to see what is in the XML:

esearch -db nucleotide -query "NM_182762.3" | efetch -format xml | xtract -outline > ed.out

If you examine the file ed.out, you will see the hierarchy for an author:

Author
    Author_name
        Person-id
            Person-id_ml

Running xtract again: -pattern is the element in the hierarchy "one level" above the -element that you want:

esearch -db nuccore -query "NM_182762.3" | efetch -format xml | xtract -pattern Person-id -element Person-id_ml | head -10

Ren B
Zakharov V
Yang Q
McMahon L
Yu J
Cao W
Xie C 
Wu J
Yun J
Lai J

It's worth spending some time with the complete edirect documentation as opposed to the simplified introduction.