Hello,
to decontaminate my de-novo-genome I'd like to blast my scaffolds against the core_nt
-database of NCBI (got the preformatted one). Unfortunately, using this huge database I constantly run out of time on my cluster even if I split the genome file to smaller pieces. To solve this problem I tried to use the -taxids
option in the blastn command to just use single species within core_nt
, but for some reason it doesn't work (still blasts against the whole database hitting more than the selected taxIDs).
A warning message looks like that: "The -taxids command line option requires additional data files. Please see the section 'Taxonomic filtering for BLAST databases' in https://www.ncbi.nlm.nih.gov/books/NBK569839/ for details."
Here it says "If you are using your own BLAST database(s) and would like to take advantage of this feature, you must set the taxonomy IDs in your database(s) and can get the taxonomy4blast.sqlite3 database by downloading https://ftp.ncbi.nlm.nih.gov/blast/db/taxdb.tar.gz , decompressing it and installing it alongside your other BLAST database(s)."
Those files I have, they came with the preformatted database and are in the same folder. Also, sqlite3
is in my conda environment. What am I missing?
Here is my command:
assembly="path_to_assembly"
database_used="path_to_database_folder"
taxIDs_used="185587,239422,9606"
thread_number=8
out_name="path_to_hitfiles_folder/hitfile"
blastn \
-task blastn \
-db $database_used \
-taxids $taxIDs_used \
-query $assembly \
-outfmt "6 qseqid staxids bitscore sseqid pident length mismatch gapopen qstart qend sstart send evalue" \
-max_target_seqs 1 \
-max_hsps 1 \
-evalue 1e-28 \
-num_threads $thread_number \
-mt_mode 0 \
-out $out_name
Thank you in advance for your help!
Ok, so I was wrong about what
taxids
does. Thanks a lot for the clarification! I will try what you suggest, this sounds great.Thank you very much, this worked! But carefull note for everyone doing the same thing: The more taxIDs you give
blastdbcmd
, the longer it takes to finish. Because I was limited in time and wanted to use 8 species I extracted the sequences for each species individually withblastdbcmd
and concatenated them later to build a common db. This was much quicker.Because I had Homo sapiens inside, the database made that way was quite big despite some of my species of interest were not covered well in
core_nt
. At the end I used a custom database made out of genome files of my species of interest because it gave me much more and more reliable hits. Depending on what you want this might be the better soulution. In this case, be aware that you might to have to add taxIDs to the taxID-mapping file available on the NCBI-website.But thank you again a lot for introducing the option of
blastdbcmd
to me, this might get again very helpful in the future in another context! And it fact, it was able to solve my problem.