Hi guys,
does anyone know how I get TaxID mapping file
for NR or Uniprot database?
Background:
I use Diamond
for my de novo transcriptome annotation. My next goal is to use hits tsv file in blobtools
for contamination detection. To do that I need my query transcript IDs with the corresponding subject TaxID in hits.tsv file. Diamond doesn't give that information but I can use blobtools taxify
option to match corresponding TaxidIDs to my subject hits. I read blobtools documentation and to do that I need TaxID mapping file
for the database that I used for annotation and that file consists of information such as.
I am not sure how to get that file so please help. :)
nodesDB file should have been installed if you had used "Install" script for
blobtools
according to : https://blobtools.readme.io/docs/taxonomy-databaseYou can find the NCBI taxonomy database files here: https://ftp.ncbi.nih.gov/pub/taxonomy/ Take a look at https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump_readme.txt for the contents.
thank you, I'll look at these files/documents. .
if I understood correctly, I might need fle
prot.accession2taxid.gz
file? According to the documentation in column 2 is Accession.version and in column 3 is TaxID. I should download that file from NCBI, unpack it and than do:Does that make sense?
Did anyone try this?
I also saw this post about getting taxonomy info in Diamond output. Still, it seems it has to be incorporated in
makedb
step + I might be getting more than 1 taxid hit according to Diamond documentation which I am not sure might work with blobtools.I know, this was posted some time ago, and I hope my comment is not off-topic, but in case anybody needs it, something concerning the columns of diamond.out and blobtools taxify:
The columns of the .out file generated from Diamond in
-outfmt 6
(!) have this order:qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore
. This can vary with other-outfmt
values. Depeding on the database used, the second column may consist of several parts, soblobtools taxify
might not recognize it.Example:
My (not very elegant) soultion was to insert a second column that only contains the protein ID with
sed 's/\(\t\)[^|]*|\([^|]*\)|.*$/\1\2&/' "$INPUT_FILE" > "$OUTPUT_FILE"
This$OUTPUT_FILE
I used for taxification (-f
).Example:
Then you use
-s 0
-t 1
-c 12
(I think, we are in python, so the columns start with 0, -s and -t according to the columns of your taxid_map, -c belongs to your hitsfile (bitscore)).Your
taxified.out
file is your-t
inblobtools create
, the taxID will be in the second column of this file.Besides, one than more hit per contig works with Blobtools, it selects the best! For several taxids it uses the first one.