Question

Which additional files are needed for the -taxids option in blast?

0

Entering edit mode

3 months ago

WaspInSpace • 0

Hello,

to decontaminate my de-novo-genome I'd like to blast my scaffolds against the core_nt-database of NCBI (got the preformatted one). Unfortunately, using this huge database I constantly run out of time on my cluster even if I split the genome file to smaller pieces. To solve this problem I tried to use the -taxids option in the blastn command to just use single species within core_nt, but for some reason it doesn't work (still blasts against the whole database hitting more than the selected taxIDs).

A warning message looks like that: "The -taxids command line option requires additional data files. Please see the section 'Taxonomic filtering for BLAST databases' in https://www.ncbi.nlm.nih.gov/books/NBK569839/ for details."

Here it says "If you are using your own BLAST database(s) and would like to take advantage of this feature, you must set the taxonomy IDs in your database(s) and can get the taxonomy4blast.sqlite3 database by downloading https://ftp.ncbi.nlm.nih.gov/blast/db/taxdb.tar.gz , decompressing it and installing it alongside your other BLAST database(s)."

Those files I have, they came with the preformatted database and are in the same folder. Also, sqlite3 is in my conda environment. What am I missing?

Here is my command:

assembly="path_to_assembly"
database_used="path_to_database_folder"
taxIDs_used="185587,239422,9606"
thread_number=8
out_name="path_to_hitfiles_folder/hitfile"

blastn \
 -task blastn \
 -db $database_used \
 -taxids $taxIDs_used \
 -query $assembly \
 -outfmt "6 qseqid staxids bitscore sseqid pident length mismatch gapopen qstart qend sstart send evalue" \
 -max_target_seqs 1 \
 -max_hsps 1 \
 -evalue 1e-28 \
 -num_threads $thread_number \
 -mt_mode 0 \
 -out $out_name

Thank you in advance for your help!

decontamination taxids blastn • 819 views

ADD COMMENT • link 9 weeks ago by WaspInSpace • 0

1

Entering edit mode

3 months ago

JustinZhang ▴ 140

See previous topic here

ADD COMMENT • link 3 months ago by JustinZhang ▴ 140

0

Entering edit mode

Thank you for the hint how to use the -taxids option to make few-species databases with makeblastdb and blastdb_aliastool. This might be very helpfull.

ADD REPLY • link 3 months ago by WaspInSpace • 0

0

Entering edit mode

Quick comment for everyone who plans to do this: For me, blastdb_aliastool wasn't able to do the job because the added databases (8 species) were to big (BLAST Database error: BLASTDB alias file creation failed. Some referenced files may be missing) But they weren't missing, it was just too much. So I ended up concatenating genome fastas of the species, adding the scaffold names and species taxIDs to the taxIDmap and building a new db out of this concatenated fasta using the taxid map. This worked fine. I assume, blastdb_aliastool is only the right choice if you plan to fuse very small databases limited in number (worked up to two species for me, was not able for all (final database had 6GB)).

ADD REPLY • link 9 weeks ago by WaspInSpace • 0

score 2 · Accepted Answer · 2025-04-10

2

Entering edit mode

3 months ago

GenoMax 152k

Using the taxids option filters the BLAST search results, which come from entire database. That option does not pre-filter the BLAST database up front before doing the search.

If you need only a certain set of taxid's then you should extract those sequences from core_nt using blastdbcmd and the build a new local database of just those sequences. Use the custom taxID file as shown in: https://www.ncbi.nlm.nih.gov/books/NBK569841/ with that local database.

ADD COMMENT • link 3 months ago by GenoMax 152k

0

Entering edit mode

Ok, so I was wrong about what taxids does. Thanks a lot for the clarification! I will try what you suggest, this sounds great.

ADD REPLY • link 3 months ago by WaspInSpace • 0

0

Entering edit mode

Thank you very much, this worked! But carefull note for everyone doing the same thing: The more taxIDs you give blastdbcmd, the longer it takes to finish. Because I was limited in time and wanted to use 8 species I extracted the sequences for each species individually with blastdbcmd and concatenated them later to build a common db. This was much quicker.

Because I had Homo sapiens inside, the database made that way was quite big despite some of my species of interest were not covered well in core_nt. At the end I used a custom database made out of genome files of my species of interest because it gave me much more and more reliable hits. Depending on what you want this might be the better soulution. In this case, be aware that you might to have to add taxIDs to the taxID-mapping file available on the NCBI-website.

But thank you again a lot for introducing the option of blastdbcmd to me, this might get again very helpful in the future in another context! And it fact, it was able to solve my problem.

ADD REPLY • link 9 weeks ago by WaspInSpace • 0