BLAST multiple staxids
1
0
Entering edit mode
5.9 years ago

I am using following output format to get my blastp output:

-outfmt, 6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore staxids

However, I get multiple values for staxids. I wonder why/what are they? Because I was expecting only one taxonomy ID of the subject. I cannot find this on BLAST documentation

Look at this example: https://ibb.co/9ZzJd9z

alignment sequence • 3.7k views
ADD COMMENT
0
Entering edit mode

What is the different between those two? How can a subject can have multiple Taxonomy IDs?

ADD REPLY
2
Entering edit mode
5.9 years ago
gb ★ 2.2k

From the help page:

staxid means Subject Taxonomy ID
staxids means unique Subject Taxonomy ID(s), separated by a ';' (in numerical order)

So you can use staxid instead of staxids

ADD COMMENT
0
Entering edit mode

I agree. Thanks! But still I am curious about how we can get multiple Taxonomy IDs when we use staxids option

ADD REPLY
0
Entering edit mode

I think I do not have the exact explanation for you but look up the taxonids. In your case the protein comes from a Escherichia coli (562) which is a species rank, and Escherichia coli 3-105-05_S1_C2 (1444084) which is the same species but a certain strain. So I think it has something to do with taxonids that have the same species but an extra strain number or code.

ADD REPLY
0
Entering edit mode

Yes, I also had a look exactly on those TaxIDs and came to understanding as you. But now I think, may be other strains on the E coli also have the same sequence, that is why I get multiple TaxonomyIDs

ADD REPLY
0
Entering edit mode

Do you know the taxonomic assignment program MEGAN? Its manual suggests that those multiple IDs are indeed from other organisms with the same sequence: "...an entry in a reference database may have more than one taxon associated with it. For example, in the NCBI-NR database, an entry may be associated with up to 1000 different taxa. This implies, in particular, that a read that may be assigned to a high level node (even the root node), even though it only has one significant hit, if the corresponding reference sequence is associated with a number of very different species." http://ab.inf.uni-tuebingen.de/data/software/megan6/download/manual.pdf

So, if a reference sequence has multiple associated IDs, MEGAN assigns it to their "lowest common ancestor" instead of just to the organism that the sequence came from.

The BLAST help suggests that it's not just 'staxids' that can have multiple entries. 'sscinames', 'scomnames', 'sblastnames' and 'sskingdoms' might refer to these associated taxa too. It would be nice if the documentation was clearer!

ADD REPLY
0
Entering edit mode

This is not exactly MEGAN works or how you can use it. Also, you can only determine a certain species if you look at a specific marker gene like 16S or COI. If you blast a COI sequence and you have a significant good hit you can say that that is the right species. If you do not have a good hit then you can use MEGAN, you blast your COI marker and get 5 hits above a certain treshold. Then with MEGAN you can find the lowest common ancestor of those 5 hits and that will be the identification for your gene.

ADD REPLY
0
Entering edit mode

staxids doesn't work for diamond unfortunately.

ADD REPLY

Login before adding your answer.

Traffic: 1493 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6