Question

Download all Bacteria accession list from NCBI

0

Entering edit mode

3.8 years ago

K.Gee ▴ 40

Hello, biostars,

I want to download all the accession numbers of the bacteria proteins From https://www.ncbi.nlm.nih.gov/protein/?term=Bacteria -->send to --> file --> Format (Accession List) and create file seems to not working for bacteria ( I tested with viruses, archaea and works perfectly) After that, I tried to extract all accession numbers list via the command prompt, but I could not do so. Even ncbi proposed command for the genomes doesn't seem to work "https://www.ncbi.nlm.nih.gov/protein/?term=Bacteria" option command-line tool which gives

datasets download genome taxon 2 --filename bacteria.zip

I got this error unknown flag: --filename

I also tried to "change" some commands such as genome to genes like ... datasets download gene taxon 2 --filename bacteria.zip, but it downloads the gene with id 2 (parses the term taxon) and I also tried curl 'ftp://ftp.ncbi.nlm.nih.gov/protein/?term=bacteria%5BAll+Fields%5D

Does anybody have an idea how to manipulate this issue?

number accession NCBI • 3.0k views

ADD COMMENT • link 3.8 years ago by K.Gee ▴ 40

0

Entering edit mode

A related Python script that you could use (search by FASTA title): How to download all sequences of a list of proteins for a particular organism

ADD REPLY • link 3.8 years ago by Kevin Blighe 88k

0

Entering edit mode

Thanks for the response. I will use the script if I ll need to download the respective seqs. Again thanks a lot for the script :D

ADD REPLY • link 3.8 years ago by K.Gee ▴ 40

0

Entering edit mode

AFAIK datasets is only meant to work with genome level data. Doing

./datasets download genome taxon 2

will get you information about bacterial genome accessions. You can use

--reference         limit to reference and representative (GCF_ and GCA_) assemblies
--refseq            limit to RefSeq (GCF_) assemblies

ADD REPLY • link 3.8 years ago by GenoMax 148k

0

Entering edit mode

Thanks again for the response. I knew that It was based on the genome level, but I saw an option of gene, so my point was to download all the genes and afterwards to extract the ACC numbers... I know that my point was a bit stupid and complicated :P

ADD REPLY • link 3.8 years ago by K.Gee ▴ 40

1

Entering edit mode

3.8 years ago

GenoMax 148k

If you have access to nr blast database then use blastdbcmd which is part of blast+ package.

blastdbcmd -db nr -taxids 2 -outfmt %a

If your next question is going to be about creating a subset fasta of these sequences then use

blastdbcmd -db nr -taxids 2 -outfmt %f > bacteria.fa

ADD COMMENT • link 3.8 years ago by GenoMax 148k

0

Entering edit mode

This command looks very, very interesting; however, if I understand your response well:

If you have access to nr

Did you mean locally? I'm asking because the command doesn't accept the term "taxids"

ADD REPLY • link 3.8 years ago by K.Gee ▴ 40

1

Entering edit mode

Correct. You will need to have nr blast indexes downloaded locally along with taxonomy files. Make sure you have latest blast+ installed.

ADD REPLY • link 3.8 years ago by GenoMax 148k

score 3 · Accepted Answer · 2021-04-15

3

Entering edit mode

3.8 years ago

Sej Modha 5.3k

You could use NCBI's command line eutils instead.

esearch -db protein -query 'txid2 [Orgn]'|efetch -format acc > txid2_protein_acc.txt

ADD COMMENT • link 3.8 years ago by Sej Modha 5.3k

1

Entering edit mode

A lot of records are going to be WP* accessions which point to multiple organisms. Something to keep in mind.

ADD REPLY • link 3.8 years ago by GenoMax 148k

0

Entering edit mode

Thanks a lot for the tip :-) !!!

ADD REPLY • link 3.8 years ago by K.Gee ▴ 40

0

Entering edit mode

Super thank you! Works exactly as I want!!!

ADD REPLY • link 3.8 years ago by K.Gee ▴ 40