Question

Question on how to include TAXIDs when running blast on database

0

Entering edit mode

15 months ago

Mani • 0

Hi,

I have downloaded 2500 genome assembly fasta file and changed them to database formatted files using "makeblastdb". When I blast my query on them using this command (I just show it for one genome file, GCA_000143925.2.fasta as an example) (I run blast for all in parallel):

blastn -query query.fasta -db GCA_000143925.2.fasta -outfmt "6 std qlen slen staxids sscinames" -task dc-megablast -out blast.out

I get this error message:

Warning: [blastn] Taxonomy name lookup from taxid requires installation of taxdb database with ftp://ftp.ncbi.nlm.nih.gov/blast/db/taxdb.tar.gz

(I have downloaded and unzipped taxdb.tar.gz file and added the path to my system .bashrc (export TAXDB=/home/manighanipoorsamami/local/taxdb), but still get the same error.

Also here is the first hit line of blast.out file:

ERV2-1_H.orn#LTR/ERV    GL380075.1    81.633    49    7    1 7855    7901    77025    77073    0.018    46.4    8632 90134    0    N/A

as you see it does not give staxids sscinames.

Then, I created the file "taxid_mapping_file.txt" by adding taxid to genome name:

cat taxid_mapping_file.txt

GCA_000143925.2.fasta    135651

The, ran this:

makeblastdb -in GCA_000143925.2.fasta -taxid_map taxid_mapping_file.txt -parse_seqids -dbtype nucl

but got this error:

Building a new DB, current time: 01/25/2024 16:35:44
New DB name: /home/manighanipoorsamami/New_PhD_program/HTT_sea_snake/test/GCA_000143925.2.fasta
New DB title:  GCA_000143925.2.fasta
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 3000000000B
Adding sequences from FASTA; added 3305 sequences in 1.22134 seconds.


Error: [makeblastdb] No sequences matched any of the taxids provided.

Can you please help me resolve this and run a blast command that can add "staxids" and "sscinames" to blast output?

I could not find anything in NCBI blast manuals.

Cheers,

Mani

NCBI blastn taxid blast makeblastdb • 1.1k views

ADD COMMENT • link updated 15 months ago by GenoMax 151k • written 15 months ago by Mani • 0

GenoMax · Answer 1 · 2024-01-25

The mapping taxonomy ID file is a bit wrong:

cat taxid_mapping_file.txt

GCA_000143925.2.fasta 135651

In there you need the IDs of each sequence, not the filename itself.

So for a fasta file like this:

>seq1 blablabla   
ATCTAGCTAGCTAGCTAGCTAGA   
>seq2 blablabla   
ATCYAYGTCGACTGATCGA

The taxid_mapping_file would look like this:

seq1   135651   
seq2   135651

score 0 · Answer 2 · 2024-01-25

If each assembly is of a single species, there is no need to use -taxid_map because all sequences in the db have the same, so you can simply use the -taxid parameter:

makeblastdb -in GCA_000143925.2.fasta -taxid 135651 -parse_seqids -dbtype nucl

You can get the taxid from the assembly in different ways, either by issuing an eutils query, or if you have your mapping file already, you can simply use grep to find it in a script, e.g.

TAXID=$(grep -e $ASSEMBLY $mapping_file | cut -f2}

You will still need the taxonomy downloaded to extract names but that should work immediately with these databases. If you want to blast against all databases at once, create a single alias DB with blastdb_aliastool.