If you set up the ncbi taxonomy db (explained in the manual) then you could have used custom output to include e.g. species name:
-outfmt <String>
alignment view options:
0 = pairwise,
1 = query-anchored showing identities,
2 = query-anchored no identities,
3 = flat query-anchored, show identities,
4 = flat query-anchored, no identities,
5 = XML Blast output,
6 = tabular,
7 = tabular with comment lines,
8 = Text ASN.1,
9 = Binary ASN.1,
10 = Comma-separated values,
11 = BLAST archive format (ASN.1)
Options 6, 7, and 10 can be additionally configured to produce
a custom format specified by space delimited format specifiers.
The supported format specifiers are:
qseqid means Query Seq-id
..
..
staxids means unique Subject Taxonomy ID(s), separated by a ';'
(in numerical order)
sscinames means unique Subject Scientific Name(s), separated by a ';'
scomnames means unique Subject Common Name(s), separated by a ';'
sblastnames means unique Subject Blast Name(s), separated by a ';'
(in alphabetical order)
sskingdoms means unique Subject Super Kingdom(s), separated by a ';'
(in alphabetical order)
You don't have to redo the blast though, because you can connect the GI numbers to NCBI taxonomy IDs with Entrez Direct.
epost -db protein -id 807531834 | elink -target taxonomy | efetch -format docsum | xtract -element ScientificName
Campethera nivosa
Or maybe LCA classification method would be better?
thanks for your comment. For blast output, I used
-outfmt '6 std sscinames scomnames stitle'
, but I just found that there is N/A forsscinames
andscomnames
even thetaxdb.btd
andtaxdb.bti
were in the same directory with blast database. Could you please let me know what's wrong here?. Is it possible to extract species name based onstitle
information?Create
.ncbirc
file in your$home
(nano ~/.ncbirc
) and paste the following there (obviously change the path to the correct one). Assuming you're using a prebuilt database (in contrast to making one yourself from a fasta file), the taxdb addition should now work as long as it and your db are in the dir where BLASTDB points..thanks. I creat this file, but the problem still exist. Also, when I try to apply the
source ~/.ncbirc
, it gave me the error :while blast program is in the PATH and for example blastx can called from anywhere. I'm totally confused. What do you think dear friend?
I think the blast binaries source the ~/.ncbirc file before execution. Can you
cat ~/.ncbirc
andls -la /path/to/your/db/dir
Yes,
cat ~/.ncbirc
show what you suggested me to writ that is:and the output of second command is showing the content of the related directory.
Please let me know if there is any command to extract and count species based on stitle information?
So the preformatted db and taxdb have been extracted to
/home/seta/software/ncbi-blast-2.2.30+/bin/dbtest/
and you're blasting against the db file that is in that dir with blast version 2.2.30+?which blastx
returns/home/seta/software/ncbi-blast-2.2.30+/bin/blastx
?Is /home right in your root? I don't recall what is in stitle, but as I wrote above, you can link gi to taxonomy..
Yeah, that's exactly right. Also,
which blastx
return the output like what you wrote in the post. I'm working on a server as a regular user and not in the root position.the
stitle
information is something like this linewhich the species name was written in the bracket. I'm newbie in this field, but I think there should be a command to extract the species name with their counts, say Citrus sinensis 4. As I see, you are so expert. Please let me to ask again for your help. Many thanks
It's in the column of 17.
So, for example fields 1-12 and species name:
Or if you're just interested in counts:
Note, for this to be reasonable, your output should be sorted for best hits per query. But how can you sort blastx output for best hits when different ORFs from the same contig have the same ID in column 1? I guess you could parse the coordinates of ORFs into this and then require no overlap (although sometimes there's overlap), but this is starting to be too complicated for a one-liner. This is one of the main reasons why I prefer protein prediction + blastp over blastx. Another is speed.
Also, I can't stress enough how this is a very poor way to assign taxonomy to sequences. For example, I recently blasted some bacterial proteome against nr, removed selfhits, sorted for best hits, and then checked distribution of species. Instead of just one, there were dozens and dozens spanning multiple bacterial phyla. I think a few hits were even to archaea. This is due to 1) horizontal gene transfer and 2) because there's no hand of God that forces proteins to evolve exactly the same way. Just because two species are deemed the closest relatives due to the highest similarity of the 16S rRNAs doesn't mean that every single one of their proteins will represent reciprocal blast hits. LCA method with a reasonable bit score cutoff (like 0.9X best hit) is a much better way to assign taxonomy to sequences..
hanks for being with me. I applied your command on the blast output, but it generates the normal tabular format of output, like below:
My original output is like here, which the stitle information is the last column,
I need help to extract the species name (appear in the bracket in
stitle
filed) with their counts in this output.It's in column 16, not 17. Anyway, since it's the last column you can just:
But really, this is a very bad way to assign taxonomy to sequences and will overestimate the actual number of species like at least 20 fold. More than that, I doubt that you have somehow managed to sort your blastx output for best hits in a reasonable way so your result is the species distribution of number of hits against each query.
edit. Well maybe with transcriptome you can expect that the majority of the contigs represent single protein mRNAs instead of operons? Median lenght of transcripts could give you a clue? So you could sort based on column1 first, which would then restrict it to one protein per contig..
Thanks so much for your help and time. I'm glad to hear why you believed it's bad way and may overstimate the real counts?
Given your experience, I may shift to LCA as you suggested. However, it has not been mentioned the way in many papers, usually, author just show the species distribution in the pie chart.
No problem. Also, please no piecharts. When there are like 10 different hues of every color nobody can ever gain anything from piecharts. I wish all journals would just ban them. Heatmaps are a far superior way to represent species distribution, IMO.
Also, you can read a little bit about best, representative and lca hits in e.g. mg-rast manual: ftp://ftp.metagenomics.anl.gov/data/manual/mg-rast-tech-report-v3_r1.pdf
Thanks for your comment. I got the best hit from tabular format output (6) using the command:
Please let me know it's a right way?
I just found your sentences about your preference for ORF prediction+blastp over blastx. There is a lot of program for ORF finding, I'll be happy to know which program you use for ORF prediction?
So, the above command to extract best hit can not be right, yes?
As long as you're fine with discarding everything but the highest scoring ORF from operons and possible other mRNAs that include multiple ORFs (and from misassemblies in which two or more mRNAs are joined somehow).
As I told earlier, I'm new in this filed and talking with you is really helpful for me, thanks. I would be highly appreciate if you could please let me know which ORF prediction program you used, then what is your command to get the best hits? I'm working on plant.
No problem. I learn a lot from the Internet as well. The command you posted for best hits gives preference to bitscore > evalue > identity. I think it's the best option for rudimentary blast-based functional annotation. But for relative taxa abundance in protein space, the LCA method is IMO more reasonable. Instead of species level, which is unreasonable expectation because of what I wrote earlier, the LCA results are a mix of up to genus, family, etc. taxonomic rank annotations. In this context, you can think of proteins that are associated with low taxonomic rank LCAs as marker genes for those taxa. I think one of the best uses of metatranscriptomic reads is mapping coverage of corresponding metagenomic assemblies. I am not going to advertise any particular algorithm for protein prediction but prodigal has very convenient output options.
Thanks for backing to me. As I see, the prodigal tool designed for prokaryotic genome. I'm working on plants, so it can not be suitable for me. You know, there is a lot of ORF prediction tool, which each of them has some pros and cons that I have any experience about them, so it's a bit difficult to choose the tool. Could you please let me to have your email address just for urgent question and consultation with you?
Thanks again for all helpful comment
I can't tell you anything about plants. However, I'm sure there's relative literature out there for you. Or make more specific questions here and maybe somebody who knows will help you out. Before I give you my email, I need to know how many $$$ an hour you offer for consultation. For urgent matters hourly rates start at $500. Weekday out of office hours rate is $750. Weekends $1000. Special rates for longer projects.
Thanks friend. I'm just a student and you're like a perfect teacher, I don't think $$. Also, as I told you I'm working on plants, I request your email for consultation and especially introduce you to one of my friends hardly busy with metagenomics and metatranscriptomics. Again thanks for all your help.