Hello,
I did some testing with NCBI Datasets CLI (both gene and genome endpoints) and e-utils, and wanted to share some thoughts. The best approach will depend on the questions you are trying to answer and the data you need. :)
I used @Genomax approach to get the taxids and also to download the protein sequences using eutils. Here's the summary:
datasets gene:
It returns information for the 17 reference genomes annotated by NCBI's RefSeq annotation pipeline, plus mitochondrial proteins annotated as part of the NCBI Organelle RefSeq Project. It took around 4 hours to download everything while iterating over the list of Hemiptera taxids.
# BLAST
get_species_taxids.sh -t 7524 > 7524-taxid.list
# Get number of taxids
wc -l 7524-taxid.list
49847 7524-taxid.list
# download protein sequences from all taxids
cat 7524-taxid.list | while read TAXID; do datasets download gene taxon "$TAXID" --filename $TAXID.zip; done
873 data packages downloaded
# Count number of proteins:
cat */ncbi_dataset/data/protein.faa > all_hymenoptera_proteins.faa;
grep -c ">" all_hymenoptera_proteins.faa
464,271 proteins
datasets genome:
This command downloads protein sequences from all assembled genomes annotated by either NCBI's RefSeq annotation pipeline (GCF accessions) or annotations submitted to GenBank (GCA accessions). It downloaded everything in less than a minute.
# download protein sequences using the genome endpoint
datasets download genome taxon 7524 --include protein --filename 7524-genome-protein.zip
# Count number of proteins
cat 7524-genome-protein/ncbi_dataset/data/*/protein.faa | grep -c ">"
969,059 (22 GCF and 17 GCA annotated genomes)
551,399 (22 GCF)
417,660 (17 GCA)
e-utils:
time esearch -db protein -query "hemiptera" | efetch -format fasta > file.fa
grep -c ">" file.fa
1,511,837
There are a few things I want to point out regarding e-utils:
- This search returns sequences that are nor part of Hemiptera. If you look at the top left corner in the web results, you can see the number of results for plants, bacteria, fungi. The reason is that this search was a string search and not a taxonomic one. You can restrict the results to the desired taxonomy both in the web (using the advanced search option) and on e-utils (by adding the flag
-organism Hemiptera
).
- A lot of the sequences returned are partial, in contrast to the results obtained using datasets.
Let me know if you have any questions or if there's anything we can do to help you.
This is super useful, thanks. However, I wanted to report that upon unzipping I got a few instances of
bad CRC
. I tried to re-download and got errors again, but not in the same files. For example:Tried this just now. No errors either with 6157 or 7524. Must be a local issue.
This issue was reported on github and the fix of passing by dehydrate works for me (incidentally, it also seems much faster). Basically:
I was using following for my test. No rehydrate was required.