Question

What is the best way to get all bacteria proteins from nr?

0

Entering edit mode

3.4 years ago

O.rka ▴ 740

I'm trying to figure out the best way to do this. I have the newest taxdump.tar.gz and prot.accession2taxid.gz files from NCBI.

Is there a way to use TaxonKit to get all of the species-level identifiers from bacteria and then use this to pull out the proteins from nr?

protists database nr • 1.0k views

ADD COMMENT • link updated 3.4 years ago by GenoMax 148k • written 3.4 years ago by O.rka ▴ 740

0

Entering edit mode

I am reasonably certain this was asked recently. Have you searched Biostars via google?

ADD REPLY • link 3.4 years ago by GenoMax 148k

score 0 · Answer 1 · 2021-08-04

The fastest way I know is not to get them from nr at all. Uniprot has files with taxonomic divisions of all sequences, and they update them regularly.

https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/taxonomic_divisions/

You will need both sprot and trembl files that end in .dat.gz. esl-reformat from the HMMer package can convert these files into fasta.

But if you really want proteins from nr, blastdbcmd can do it if you have a list of accession numbers (needs nr-formatted files with accession numbers just like BLAST). I don't think this will be faster than what I described above because bacterial proteins will comprise at least half the database.

blastdbcmd -db nr -dbtype prot -entry_batch protein_list -out proteins.fas -outfmt %f -logfile proteins.log

By the way, if an answer solves your problem, please consider accepting it.