Getting full taxonomy from BLAST results without "staxids" in output
1
0
Entering edit mode
7.8 years ago
ScubaChris ▴ 10

Hi everyone,

long story short, I dun goof'd: I ran DIAMOND on a huge number on metagenomics samples, but I didn't include the "staxids" parameter in the output. (In my puny defense, this parameter wasn't mentioned in the DIAMOND manual). Now I have a couple of hundred thousand output lines looking like this:

042SRF022_1 gi|751637161|ref|WP_041104882.1|    40.4    151 82  2   999 547 1   143 2.8e-21 110.9

Is there a sane way of getting the taxonomy for each output line so I can create a report without having to run the entire thing again? I tried getting the "gi_taxid_nucl.dmp.gz" from NCBI and running grep on each gi, but it a) takes ages and b) doesn't seem to work. I am thinking of putting the entire file in an sql db and start running queries on it. Any ideas welcome.

taxonomy blast diamond metagenomics • 2.5k views
ADD COMMENT
3
Entering edit mode
7.8 years ago

I tried getting the "gi_taxid_nucl.dmp.gz" from NCBI and running grep on each gi, but it a) takes ages

because it's the wrong method: extract the gi from the blast output and sort on this column

awk -F '|' '{printf("%s\t%s\n",$2,$0);}' | sort -t $'\t' -k1,1

sort gi_taxid_nucl.dmp.gz on the gi column

and then use linux join to merge both files.

and b) doesn't seem to work.

because with only grep "2" you'll get "2" and "22" and "222" and "gene2" etc..

ADD COMMENT

Login before adding your answer.

Traffic: 1730 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6