Question

How to create a tax_id_map file for getting scientific names in BLASTN output

0

Entering edit mode

3 months ago

Maria • 0

Hi,

I would like to get the sscinames in my blastN output. Below is my fasta header for the reference sequences. How can I use these fasta headers to create the taxid_map.txt file? Is there any easy way to get the tax ids and names for creating the taxid_map file?

NW_021628210.1 Plasmodium gallinaceum strain 8A genome assembly, contig: PGAL8A_v1_53, whole genome shotgun sequence

I extracted the list of tax_ids for plasmodium species using the following code. How do I get the names?

./get_species_taxids.sh -t 5820 > 5820_taxids.txt

BLAST NCBI • 573 views

ADD COMMENT • link 3 months ago by Maria • 0

1

Entering edit mode

3 months ago

GenoMax 147k

Using EntrezDirect (use conda method to install as shown by Philipp Bayer) you can get the names:

$ efetch -db taxonomy -id 6990 -format docsum | xtract -pattern DocumentSummary -element Id,ScientificName
6990    Nauphoeta cinerea
$ efetch -db taxonomy -id 5820 -format docsum | xtract -pattern DocumentSummary -element Id,ScientificName
5820    Plasmodium

ADD COMMENT • link 3 months ago by GenoMax 147k

0

Entering edit mode

Ah I see, I'm tired. I went down the taxids for blast rabbit hole, didn't think to add the names!

ADD REPLY • link 3 months ago by Philipp Bayer 8.7k

score 2 · Accepted Answer · 2024-08-20

2

Entering edit mode

3 months ago

Philipp Bayer 8.7k

Are all of your sequences from NCBI? If so you can use Entrez to download the taxonomy IDs associated with your IDs.

# get the IDs
grep '>' your_database.fasta | cut -f 1 -d ' ' > all_ids.txt

# query entrez nucleotide database - your data may be from somewhere else!
cat all_ids.txt | efetch -db nuccore -format docsum | xtract -pattern DocumentSummary -element AccessionVersion,TaxId > taxids_for_blast.txt

You can install efetch and xtract via conda:

conda install -c bioconda entrez-direct

You can also pull out genus and species names from your sequence names and hope for the best, but that will bite you when you have subspecies or cf. or sp. names. Below I assume that you always have only one species name ('Homo sapiens') but there are many sequences that have 'several' names like "Bacillus sp. blabla -five" in which case the below code breaks, but it's faster than querying Entrez above.

Here I use awk to put a tab between the sequence ID and the genus and species names. Using taxonkit installable like entrez-direct above.

grep '>' your_database.fasta | awk '{print $1"\t"$2" "$3}' | taxonkit name2taxid -i 2  > names_and_taxids.txt
cut -f 1,3 -d "\t" names_and_taxids.txt > taxids_for_blast.txt

The taxonkit way is substantially faster if you have thousands of sequences but will break easily.

ADD COMMENT • link 3 months ago by Philipp Bayer 8.7k

0

Entering edit mode

@Phillip Thank you for your answer. When I used the cut command, I got the error shown below.

cut -f 1,3 -d "\t" names_and_taxids.txt > taxids_for_blast.txt

 cut: the delimiter must be a single character

I couldn't solve it. Could you help me?

ADD REPLY • link 3 months ago by Maria • 0

0

Entering edit mode

Ah I see, instead of typing backslash t, hit CONTROL V and hit the tab key once - it treats the backslash and the t as two separate characters, hitting CONTROL V and Tab

See this SO answer: https://unix.stackexchange.com/a/35370

ADD REPLY • link 3 months ago by Philipp Bayer 8.7k

0

Entering edit mode

It worked well. Thank you! When I created the database using extracted ids, I got the following error:

makeblastdb -in plasmodium_seqs.fna -out plasmodium_db -parse_seqids -dbtype nucl -taxid_map taxids_for_blast.txt

Maximum file size: 1000000000B Adding sequences from FASTA; added 8264 sequences in 4.29002 seconds.

Error: [makeblastdb] No sequences matched any of the taxids provided.

Could you help me to solve this error?

ADD REPLY • link 3 months ago by Maria • 0

0

Entering edit mode

It works now. I removed the > symbol.

ADD REPLY • link 3 months ago by Maria • 0