Question

esearch command-line tool only returns summary of hits?

1

Entering edit mode

5.4 years ago

JV ▴ 470

I want to be able to automatically download only the representative assemblies for a specific taxon (necessary for many clinically relevent taxons, as there are often hundreds to thousands of assemblies for the same species).

I thought it would be obvious to use the ncbi e-utils for these. So the first step would be to get a list of accession numbers of all assemblies that match my query (chosen taxon + labelled as "representative genome"). I used the following command for that:

esearch -db assembly -query '((txid662[Organism])) AND "representative genome"[RefSeq Category]'|

However this only gives me this overview, with which i can't really do anything:

<ENTREZ_DIRECT>
  <Db>assembly</Db>
  <WebEnv>NCID_1_30175195_130.14.18.48_9001_1584976543_326121091_0MetA0_S_MegaStore</WebEnv>
  <QueryKey>1</QueryKey>
  <Count>111</Count>
  <Step>1</Step>
</ENTREZ_DIRECT>

I know the query works, as I expect exactly 111 representative genomes for this taxon, but how can i get to the actual list of accession numbers?

In order to try some troubleshooting, I tried going through the examples in the tutorial of the corresponding web-based esearch API and modified the protein example (that is supposed to give me list of protein accession numbers with a certain molecular weight) for the UNIX CLI version:

esearch -db protein -query '70000:90000[molecular weight]'

But here I also just get a rather useless overview:

 <ENTREZ_DIRECT>   <Db>protein</Db>  
 <WebEnv>NCID_1_30143830_130.14.22.76_9001_1584976742_1372775979_0MetA0_S_MegaStore</WebEnv>
 <QueryKey>1</QueryKey>   <Count>35421275</Count>   <Step>1</Step>
 </ENTREZ_DIRECT>

The available command line options seem to be very limited, and I can find none that would give me something else than an overview. How do I get to a list of accessions using esearch on the command line?

ncbi e-utils esearch assembly • 2.2k views

ADD COMMENT • link updated 5.4 years ago by vkkodali_ncbi ★ 3.8k • written 5.4 years ago by JV ▴ 470

1

Entering edit mode

esearch must be followed by efetch to download the data.

ADD REPLY • link 5.4 years ago by Pierre Lindenbaum 166k

score 2 · Answer 1 · 2020-03-23

2

Entering edit mode

5.4 years ago

vkkodali_ncbi ★ 3.8k

I know the query works, as I expect exactly 111 representative genomes for this taxon, but how can i get to the actual list of accession numbers?

Typically you can pipe esearch results to efetch with -format acc to get a list of accessions. For example, if we use your second query, you can do something like:

esearch -db protein -query '70000:90000[molecular weight]' \
  | efetch -format acc -stop 10
WP_165794287.1
WP_165794283.1
WP_165794228.1
...

Note, I have added -stop 10 to restrict the number of results to 10. If you want all of the results, you should remove that.

Assembly db on the other hand does not support this. You can, however, pipe esearch output to efetch -format uid to get a list of unique identifiers but I doubt that is what you are looking for. In order to get a list of accessions, you will have to go through esummary and use xtract as shown below:

esearch -db assembly -query '((txid662[Organism])) AND "representative genome"[RefSeq Category]' \
  | esummary \
  | xtract -pattern DocumentSummary -element Organism,Taxid,AssemblyAccession
Vibrio astriarenae (g-proteobacteria)   1481923    GCF_010587385.1
Vibrio ponticus (g-proteobacteria)      265668     GCF_009938225.1
Vibrio taketomensis (g-proteobacteria)  2572923    GCF_009938165.1
...

ADD COMMENT • link 5.4 years ago by vkkodali_ncbi ★ 3.8k

0

Entering edit mode

thanks that helped a lot. However, it seems i can't download the corresponding assemblies either from the uids or from the accession numbers?

Trying, to download even just one of them does not work: getting an overview for single entries is ok:

efetch -db assembly -format docsum -id 5805891

but it is not possible to get the genbank or fasta files for these entries. if i try "-format" options "gb" or even just "fasta", I just get errors. Do you have advice on how would i get to the corresponding genbank files using efetch?

I know I can loop through the accessions to download them from the ftp: server using wget, but that does not seem very robust in case the folder structure on that server should change sometime. SO I would prefer a way using the e-util tools if possible...

ADD REPLY • link 5.4 years ago by JV ▴ 470

1

Entering edit mode

You cannot download an entire assembly using efetch. You can use the following to download entire genome sequences in fasta format (see https://www.biostars.org/p/427745/):

esearch -db assembly -query '((txid662[Organism])) AND "representative genome"[RefSeq Category]' \
| esummary \
| xtract -pattern DocumentSummary -element FtpPath_RefSeq \
| while read -r url ; do 
    path=$(echo $url | perl -pe 's/(GC[FA]_\d+.*)/\1\/\1_genomic.fna.gz/g') ; 
    wget -q --show-progress "$path" -P genome_data ; 
done

Alternatively, you can execute the exact same query in the NCBI Assembly portal and use the blue 'Download Assemblies' button to download the RefSeq genomic sequence in FASTA format for all of the results from your query.

ADD REPLY • link 5.4 years ago by vkkodali_ncbi ★ 3.8k

0

Entering edit mode

thanks for the help. However, I already frequently used the web-based query and download, and was now looking for an option to automize this via a script. Using wget instead to access the assemblies via their accession numbers from the ftp server was also an alternative solution i had already implemented, but i felt that this assumes that the ncbi ftp folder structure never changes. I had hoped that by implementing all steps via ncbi-tools, i could have a long-term usable script as long as i keep the ncbi-tools updated (eg. via conda). But it seems strange that i can't download the assemblies via efetch, no? I mean it is perfectly possible to download protein and gene sequences from genbank via accession number with efetch... why not assemblies?

ADD REPLY • link 5.4 years ago by JV ▴ 470

0

Entering edit mode

The NCBI genomes folder architecture is actually quite stable. That is not to say that it will never change. However, any potential changes to it will be announced beforehand to the community to give everyone enough time to make necessary changes on their end.

On a related note, there is indeed an effort at NCBI to provide new tools to download assembly, annotation and other data. You can find more information about it here: https://www.ncbi.nlm.nih.gov/datasets. As you can tell from information on the webpage, this project is still in its inception and will only expand in functionality in the near future. As a user, your feedback would surely be most welcome.

ADD REPLY • link 5.4 years ago by vkkodali_ncbi ★ 3.8k