Question

NCBI CLI Download all proteins from Taxid

2

Entering edit mode

13 months ago

dthorbur ★ 3.0k

Among other taxonomic groups, I want to download all hemiptera proteins from NCBI using the CLI tool ncbi-datasets-cli v16.10.1 installed with conda v23.5.0.

I've tried using the following command, but get an error.

datasets download gene taxon 7524

Error: The taxonomy ID '7524' is valid for Hemiptera, but the command 'gene download by taxon' requires an at-or-below-species taxon

Alternatively, I can use the genome function over gene:

datasets download genome taxon 7524 --include protein

And whilst this works, it downloads only proteins associated with genome assemblies, getting ~930,000, rather than the ~1,400,000 listed on NCBI proteins.

I want to see if there is a significant difference in clustering and redundancy removal with MMseqs when constructing a database for these two similar datasets. I realise most of the additional proteins will be alleles of annotated genes. This is just a test dataset for a later larger project.

Regardless, is there a way to download all proteins from NCBI using a CLI tool?

ncbi • 1.7k views

ADD COMMENT • link updated 10 months ago by GenoMax 151k • written 13 months ago by dthorbur ★ 3.0k

3

Entering edit mode

11 months ago

MirianT_NCBI ▴ 770

Hello,
I did some testing with NCBI Datasets CLI (both gene and genome endpoints) and e-utils, and wanted to share some thoughts. The best approach will depend on the questions you are trying to answer and the data you need. :) I used @Genomax approach to get the taxids and also to download the protein sequences using eutils. Here's the summary:

datasets gene:

It returns information for the 17 reference genomes annotated by NCBI's RefSeq annotation pipeline, plus mitochondrial proteins annotated as part of the NCBI Organelle RefSeq Project. It took around 4 hours to download everything while iterating over the list of Hemiptera taxids.

# BLAST
get_species_taxids.sh -t 7524 > 7524-taxid.list

# Get number of taxids
wc -l 7524-taxid.list 
49847 7524-taxid.list

# download protein sequences from all taxids

cat 7524-taxid.list | while read TAXID; do datasets download gene taxon "$TAXID" --filename $TAXID.zip; done

873 data packages downloaded

# Count number of proteins:
cat */ncbi_dataset/data/protein.faa > all_hymenoptera_proteins.faa; 
grep -c ">" all_hymenoptera_proteins.faa
464,271 proteins

datasets genome:

This command downloads protein sequences from all assembled genomes annotated by either NCBI's RefSeq annotation pipeline (GCF accessions) or annotations submitted to GenBank (GCA accessions). It downloaded everything in less than a minute.

# download protein sequences using the genome endpoint

datasets download genome taxon 7524 --include protein --filename 7524-genome-protein.zip

# Count number of proteins

cat 7524-genome-protein/ncbi_dataset/data/*/protein.faa | grep -c ">"

969,059 (22 GCF and 17 GCA annotated genomes)
    551,399 (22 GCF)
    417,660 (17 GCA)

e-utils:

time esearch -db protein -query "hemiptera" | efetch -format fasta > file.fa 
grep -c ">" file.fa                                          
1,511,837

There are a few things I want to point out regarding e-utils:

This search returns sequences that are nor part of Hemiptera. If you look at the top left corner in the web results, you can see the number of results for plants, bacteria, fungi. The reason is that this search was a string search and not a taxonomic one. You can restrict the results to the desired taxonomy both in the web (using the advanced search option) and on e-utils (by adding the flag -organism Hemiptera).
A lot of the sequences returned are partial, in contrast to the results obtained using datasets.

Let me know if you have any questions or if there's anything we can do to help you.

ADD COMMENT • link 11 months ago by MirianT_NCBI ▴ 770

0

Entering edit mode

datasets download genome taxon 7524 --include protein --filename 7524-genome-protein.zip

This is super useful, thanks. However, I wanted to report that upon unzipping I got a few instances of bad CRC. I tried to re-download and got errors again, but not in the same files. For example:

unzip 6157-genome-protein.zip 
Archive:  6157-genome-protein.zip
  inflating: README.md               
  ...
  inflating: ncbi_dataset/data/GCA_006461475.1/protein.faa  
  error:  invalid compressed data to inflate
 bad CRC c512d781  (should be 9c5082e0)
 ...

ADD REPLY • link 10 months ago by dariober 15k

1

Entering edit mode

Tried this just now. No errors either with 6157 or 7524. Must be a local issue.

ADD REPLY • link 10 months ago by GenoMax 151k

0

Entering edit mode

This issue was reported on github and the fix of passing by dehydrate works for me (incidentally, it also seems much faster). Basically:

datasets download genome taxon 6157 --dehydrated --include protein --filename 6157-genome-protein.zip
unzip 6157-genome-protein.zip -d 6157-genome-protein
datasets rehydrate --directory 6157-genome-protein

datasets --version
datasets version: 16.15.0

ADD REPLY • link 10 months ago by dariober 15k

0

Entering edit mode

I was using following for my test. No rehydrate was required.

$ datasets --version
datasets version: 16.22.1

ADD REPLY • link 10 months ago by GenoMax 151k

score 3 · Accepted Answer · 2024-04-02

You can use EntrezDirect as one option. This should fetch 1466558 sequences as of today.

$ esearch -db protein -query "hemiptera" | efetch -format fasta > file.fa
>sp|A0A7D0AGU9.1|TPS_MATON RecName: Full=Terpene synthase; Short=EoTPS
MEGLVNNSGDKDLDEKLLQPFTYILQVPGKQIRAKLAHAFNYWLKIPNDKLNIVGEIIQMLHNSSLLIDD
IQDNSILRRGIPVAHSIYGVASTINAANYVIFLAVEKVLRLEHPEATRVCIDQLLELHRGQGIEIYWRDN
FQCPSEDEYKLMTIRKTGGLFMLAIRLMQLFSESDADFTKLAGILGLYFQIRDDYCNLCLQEYSENKSFC

or you could get the species level taxID's using a utility program included in blast+ distribution which then would allow you to use datasets.

$ get_species_taxids.sh -t 7524 > taxidlist