Difference between NCBI Entrez formats (e.g., xml, docsum) for efetch
2
0
Entering edit mode
10.6 years ago
paulparsons ▴ 150

I am using the NCBI EDirect UNIX command line tools to query the gene database and get some basic information of results (e.g., chromosome location, description, gene name). The documentation seems obscure and confusing to me (maybe because I don't have a bioinformatics background). After playing with the different formats, I have discovered that the docsum format seems to best suit my needs. Although I have come to this conclusion through trial-and-error, I still do not have a clear understanding of whether this is really true, nor of the difference between the possible formats for efetch. For example, what is the difference between the xml and docsum formats? Why and when should one use them?

Although I can retrieve the format outlines by doing the following:

esearch -db gene -query "brca1 [ALL]human[ORGN]" | \
  efetch -format docsum | \
  xtract -outline

and

esearch -db gene -query "brca1 [ALL]human[ORGN]" | \
  efetch -format xml | \
  xtract -outline

this gives only the syntax of the formats and not the semantics. It does not help me understand when or why someone might prefer one format over the other. Obviously they are different, but without a background in this area, it is impossible to infer the semantics from the syntax.

To make things more confusing, the documentation seems to state that the 'docsum' format is not actually a 'specified format':

Records can be retrieved in specified formats or as document summaries:

  • efetch downloads records or reports in a designated format.

Moreover, it would seem from the naming of the formats that docsum is a summary of the full xml document. However, there seem to be certain fields in the docsum that are not in the xml. As I don't have biology or genomics background, I can't tell whether certain terms refer to different things or are simply synonymous.

The answer seems like it should be so simple, yet I can't find it anywhere! Any help is much appreciated.

entrez command-line efetch edirect ncbi • 7.3k views
ADD COMMENT
0
Entering edit mode

I understand that the fields in the 'docsum' format are organised differently, and do not necessarily follow the XML syntax. If you save to file your results before xtract (in other words, if you save to file the output of efetch), you will realise that xml uses way more space. Which fields are in the docsum that you did not find in the XML? Edirect tools seem to be very handy and powerful, but I agree that the documentation is not yet complete.

ADD REPLY
2
Entering edit mode
10.6 years ago
paulparsons ▴ 150

Turns out that with EDirect, efetch -format docsum is the same as the e-utils summary, whereas efetch -format xml is the same as e-utils efetch.

Here's an answer I received from NCBI:

The edirect efetch is wrapper to a combination of two eutils fcgis; efetch.fcgi and esummary.fcgi. Basically the edirect efetch -docsum is the same as the eutils esummary.fcgi

Edirect:

efetch -db nuccore -id 6092233 -format docsum
  

...is the same as

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=nuccore&id=6092233&version=2.0

This is very different from

efetch -db nuccore -id 6092233 -format xml
  

...which is the same as

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=6092233&retmode=xml&rettype=gb

This still doesn't give me a complete understanding of their differences, but is helpful in suggesting where else to look for information. As Zag commented previously (commented previously), it is possible to redirect the results to a file for easier investigation, by doing something like

esearch -db gene -query "brca1 [ALL]human[ORGN]" | \
  efetch -format docsum > results.out

and comparing the docsum with the xml. Also, the outlines can be compared using the original method I mentioned.

ADD COMMENT
0
Entering edit mode

In the link provided by NCBI, for example, a big difference is that the XML has a field for the sequence. In your examples though (you are querying the gene database) this is not the case.

ADD REPLY
1
Entering edit mode
7.6 years ago
DCGenomics ▴ 330

Perhaps this will be useful to you:

https://github.com/NCBI-Hackathons/EDirectCookbook

ADD COMMENT
0
Entering edit mode

Please post this as a new "tutorial" post. Would be helpful for many.

ADD REPLY

Login before adding your answer.

Traffic: 2152 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6