Question

TaxID mapping file

0

Entering edit mode

12 months ago

Lada ▴ 40

Hi guys,

does anyone know how I get TaxID mapping file for NR or Uniprot database?

Background: I use Diamond for my de novo transcriptome annotation. My next goal is to use hits tsv file in blobtools for contamination detection. To do that I need my query transcript IDs with the corresponding subject TaxID in hits.tsv file. Diamond doesn't give that information but I can use blobtools taxify option to match corresponding TaxidIDs to my subject hits. I read blobtools documentation and to do that I need TaxID mapping file for the database that I used for annotation and that file consists of information such as.

in this example

I am not sure how to get that file so please help. :)

annotation blobtools RNAseq decontamination transcriptomes • 943 views

ADD COMMENT • link updated 28 days ago by WaspInSpace • 0 • written 12 months ago by Lada ▴ 40

1

Entering edit mode

nodesDB file should have been installed if you had used "Install" script for blobtools according to : https://blobtools.readme.io/docs/taxonomy-database

You can find the NCBI taxonomy database files here: https://ftp.ncbi.nih.gov/pub/taxonomy/ Take a look at https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump_readme.txt for the contents.

ADD REPLY • link 12 months ago by GenoMax 151k

0

Entering edit mode

thank you, I'll look at these files/documents. .

ADD REPLY • link 12 months ago by Lada ▴ 40

0

Entering edit mode

if I understood correctly, I might need fle prot.accession2taxid.gz file? According to the documentation in column 2 is Accession.version and in column 3 is TaxID. I should download that file from NCBI, unpack it and than do:

blobtools taxify \ 
 -f diamond.out \
 -m prot.accession2taxid.taxids 
 -s 2 \ # column of sequenceID of subject in taxID mapping file
 -t 3 # column of TaxID of sequenceID in taxID mapping file

Does that make sense?

Did anyone try this?

I also saw this post about getting taxonomy info in Diamond output. Still, it seems it has to be incorporated in makedb step + I might be getting more than 1 taxid hit according to Diamond documentation which I am not sure might work with blobtools.

ADD REPLY • link 12 months ago by Lada ▴ 40

0

Entering edit mode

I know, this was posted some time ago, and I hope my comment is not off-topic, but in case anybody needs it, something concerning the columns of diamond.out and blobtools taxify:

The columns of the .out file generated from Diamond in -outfmt 6 (!) have this order: qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore. This can vary with other -outfmt values. Depeding on the database used, the second column may consist of several parts, so blobtools taxify might not recognize it.

Example:

    contig 1      tr|A0A7M7T6N7|A0A7M7T6N7_NASVI    60.8    11166   562 63  1452449 1419066 2096    9453    0.0 12295
    contig 2      tr|A0A7M7SXJ9|A0A7M7SXJ9_STRPU    44.5    3645    1274    38  4485486 4475128 1457    4529    0.0 2803
    contig 3      tr|A0AAJ7E2M5|A0AAJ7E2M5_9HYME    70.5    1700    400 6   11202992    11208085    1   1599    0.0 2347

My (not very elegant) soultion was to insert a second column that only contains the protein ID with sed 's/$\t$[^|]*|$[^|]*$|.*$/\1\2&/' "$INPUT_FILE" > "$OUTPUT_FILE" This $OUTPUT_FILE I used for taxification (-f).

Example:

contig 1    A0A7M7T6N7  tr|A0A7M7T6N7|A0A7M7T6N7_NASVI  60.8    11166   562 63  1452449 1419066 2096    9453    0.0 12295
contig 2    A0A7M7SXJ9  tr|A0A7M7SXJ9|A0A7M7SXJ9_STRPU  44.5    3645    1274    38  4485486 4475128 1457    4529    0.0 2803
contig 3    A0AAJ7E2M5  tr|A0AAJ7E2M5|A0AAJ7E2M5_9HYME  70.5    1700    400 6   11202992    11208085    1   1599    0.0 2347

Then you use -s 0 -t 1 -c 12 (I think, we are in python, so the columns start with 0, -s and -t according to the columns of your taxid_map, -c belongs to your hitsfile (bitscore)).

Your taxified.out file is your -t in blobtools create, the taxID will be in the second column of this file.

Besides, one than more hit per contig works with Blobtools, it selects the best! For several taxids it uses the first one.

ADD REPLY • link 28 days ago by WaspInSpace • 0