I am using the NCBI EDirect UNIX command line tools to query the gene database and get some basic information of results (e.g., chromosome location, description, gene name). The documentation seems obscure and confusing to me (maybe because I don't have a bioinformatics background). After playing with the different formats, I have discovered that the docsum format seems to best suit my needs. Although I have come to this conclusion through trial-and-error, I still do not have a clear understanding of whether this is really true, nor of the difference between the possible formats for efetch. For example, what is the difference between the xml and docsum formats? Why and when should one use them?
Although I can retrieve the format outlines by doing the following:
esearch -db gene -query "brca1 [ALL]human[ORGN]" | \
efetch -format docsum | \
xtract -outline
and
esearch -db gene -query "brca1 [ALL]human[ORGN]" | \
efetch -format xml | \
xtract -outline
this gives only the syntax of the formats and not the semantics. It does not help me understand when or why someone might prefer one format over the other. Obviously they are different, but without a background in this area, it is impossible to infer the semantics from the syntax.
To make things more confusing, the documentation seems to state that the 'docsum' format is not actually a 'specified format':
Records can be retrieved in specified formats or as document summaries:
- efetch downloads records or reports in a designated format.
Moreover, it would seem from the naming of the formats that docsum is a summary of the full xml document. However, there seem to be certain fields in the docsum that are not in the xml. As I don't have biology or genomics background, I can't tell whether certain terms refer to different things or are simply synonymous.
The answer seems like it should be so simple, yet I can't find it anywhere! Any help is much appreciated.
I understand that the fields in the 'docsum' format are organised differently, and do not necessarily follow the XML syntax. If you save to file your results before
xtract
(in other words, if you save to file the output of efetch), you will realise that xml uses way more space. Which fields are in the docsum that you did not find in the XML? Edirect tools seem to be very handy and powerful, but I agree that the documentation is not yet complete.