Hi,
I'm interested in getting the scientific names of my blast hits ran locally. I see blast+ search apps have option -outfmt
which can take sscinames
(seems new in version Blast+ 2.2.28), but even using nt
from NCBI(no luck with local databases either) I get N\A
for this specifier. Similarity for '%S' of -outfmt
in blastdbcmd
.
For example:
$ blastdbcmd -db nt -entry 229577210 -outfmt '%a || %g || %T || %S || %t'
NM_001743.4 || 229577210 || 9606 || N/A || Homo sapiens calmodulin 2 (phosphorylase kinase, delta) (CALM2), mRNA
Until now I've been using taxids in a very convoluted way. I will get the GIs from my hits, then query the blast db using blastdbcmd
to get the taxid and then query the local copy of the NCBI taxonomy database with bioperl to get the scientific name. Now that I see blast+ seems to be able to directly output the scientific name, I would like to simplify things. I'm already able to simplify things a little using the also new output format specifier staxids
, so I can now get the taxid directly from the blast output.
So my questions is.
- Is there a way to build local blast databases in a way so 'sscinames' can be used to output the scientific name in blast+ results?
In a side note. If there is a way, it seems odd NCBI's nt
is not built using it. At least that is the case for the version I got from Jul 11 2013.
Thanks in advance,
Carlos
EDIT: I found I can now use staxids
to simplify my life a little. Some additional question formatting. NT updated to version from Jul 11.
Generally the sequence headers are taken from the fasta sequences. So if the fasta header has the info then blast output will display it.
makeblastdb
is used to create a local database.Sorry, but I think it is more complicated than that. For example, the taxid won't be parsed from the fasta header. If you want your locally build blast database to have taxid information for each record, you need to provide a gi to taxid map file. You can do this using
makeblastdb
option-taxid_map
. My question is how can I now include scientific names when building a blast database so I can use the new output format specifiersscinames
.Thanks, Carlos
Since the input has to include the information for it to be available in the BLAST database, I suspect this is one of the cases where you have to build the BLAST database from ASN.1 format data. However as you have noticed it appears that the BLAST databases provided by NCBI, at least 'nt' and 'nr' are missing the additional information for '%S' (and '%L').
This could be related to compatibility with the legacy NCBI BLAST programs, might be a decision made due to the resulting increase in database file size or it could be that the methods used to create these databases have problems with including this information. In either case it looks like your best bet is to contact the BLAST folks at NCBI (see http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs) and see if they can help with further information about which of their databases contain this information, and how to create your own databases containing this data.