Entering edit mode
4.7 years ago
O.rka
▴
740
I recently discovered the staxids
field with diamond
(staxid
is not a thing with diamond
). I'm trying to assign taxonomy identifiers to all of my ORFs but I'm encountering many instances of when there are 2 or more (sometimes many more).
What is the recommended way for picking the "best" one? I don't want to randomly choose one, grab the first, etc. Is there a systematic way I can do this that is robust? Maybe the one that is the "most reliable"?
Here's the example output below. I can't use a regex search for [] in the stitle
because not all of them have this suffix.
qseqid NODE_100002_length_1286_cov_2.42892_1132_1285_-
sseqid WP_021626941.1
pident 94.4
length 18
mismatch 1
gapopen 0
qstart 1
qend 18
sstart 110
send 127
evalue 0.22
bitscore 44.3
staxids 1227265;1227266
sscinames Capnocytophaga sp. oral taxon 863;Capnocytophaga sp. oral taxon 863 str. F0517
stitle WP_021626941.1 hypothetical protein [Capnocytophaga sp. oral taxon 863]
Name: 6422120, dtype: object
What do you mean by best one? Both are pointing to the same genus. If you using a 18 AA long hit it is likely not enough to give you an absolute confidence.
That's a good point! I hadn't realized this is one of the shorter ORF calls. So in this case, would it have mapped equally well to 1227265 and 1227266?