get bacteria data from NT database
1
0
Entering edit mode
6.3 years ago
agata88 ▴ 870

Hi all!

I am downloading nt database by update_blastdb --decompress nt

Now I would like to limit database to bacteria taxid:2 .

What is the fastest way to do that?

PS. I checked similar posts, but only found solution for nr database.

Best, Agata

ncbi nt • 3.4k views
ADD COMMENT
1
Entering edit mode

I used this tutorial but instead nr I used nt database: https://bioinf.shenwei.me/taxonkit/tutorial/

Fatsa was downloaded from: ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nt.gz

Taxonomy: nucl_gb.accession2taxid

Taxonomy ID: 2

Best, Agata

ADD REPLY
0
Entering edit mode

What's the end goal? Are you planning on having a local copy of NR and a bacterial DB?

If so you might be better off just restricting your BLAST searches based on GI/Accession lists, when querying the full database.

ADD REPLY
1
Entering edit mode
6.3 years ago
GenoMax 147k

Following should work. I tested it with a different taxID (not 2). So replace 2 in place of 9925. You will need the blast index files for nt.

blastdbcmd -db /path_to/nt -outfmt "%T %a" -entry all | awk '$1 == "9925" {print $2}' | xargs -n 1 sh -c 'blastdbcmd -db /path_to/nt -outfmt "%f" -entry "$0"' > bacteria_nt.fa

This will take a while. No way around it.

You could save the accessions numbers you need by doing this

 blastdbcmd -db /path_to/nt -outfmt "%T %a" -entry all | awk '$1 == "9925" {print $2}'  > acc_bact

and then extract the sequences from nt fasta files you have using faSomeRecords from Kent utilities. Don't know if that would be any faster.

EDIT 10/09/2019: This idea does not work at the top level taxID's (e.g. 2 bacteria or 2759 for Eukaryota) since the nt sequences are not annotated at that level.

ADD COMMENT

Login before adding your answer.

Traffic: 1727 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6