Species distribution in the blast output
2
1
Entering edit mode
9.2 years ago
seta ★ 1.9k

Hi friends,

I have done blastx of transcriptome assembly against nr and Uniprot database and have a output as tabular format (-outfmt 6), I would like to show species distribution based on these outputs. Could you please help me out to do this, is there any command or script for this purpose?

Thanks in advance.

species-distribution sequence blast alignment • 4.7k views
ADD COMMENT
0
Entering edit mode
9.2 years ago
5heikki 11k

If you set up the ncbi taxonomy db (explained in the manual) then you could have used custom output to include e.g. species name:

-outfmt <String>
   alignment view options:
     0 = pairwise,
     1 = query-anchored showing identities,
     2 = query-anchored no identities,
     3 = flat query-anchored, show identities,
     4 = flat query-anchored, no identities,
     5 = XML Blast output,
     6 = tabular,
     7 = tabular with comment lines,
     8 = Text ASN.1,
     9 = Binary ASN.1,
    10 = Comma-separated values,
    11 = BLAST archive format (ASN.1)
   Options 6, 7, and 10 can be additionally configured to produce
   a custom format specified by space delimited format specifiers.
   The supported format specifiers are:
           qseqid means Query Seq-id
..
..
          staxids means unique Subject Taxonomy ID(s), separated by a ';'
                (in numerical order)
        sscinames means unique Subject Scientific Name(s), separated by a ';'
        scomnames means unique Subject Common Name(s), separated by a ';'
       sblastnames means unique Subject Blast Name(s), separated by a ';'
                (in alphabetical order)
       sskingdoms means unique Subject Super Kingdom(s), separated by a ';'
                (in alphabetical order)

You don't have to redo the blast though, because you can connect the GI numbers to NCBI taxonomy IDs with Entrez Direct.

epost -db protein -id 807531834 | elink -target taxonomy | efetch -format docsum | xtract -element ScientificName
Campethera nivosa

Or maybe LCA classification method would be better?

ADD COMMENT
0
Entering edit mode

thanks for your comment. For blast output, I used -outfmt '6 std sscinames scomnames stitle', but I just found that there is N/A for sscinames and scomnames even the taxdb.btd and taxdb.bti were in the same directory with blast database. Could you please let me know what's wrong here?. Is it possible to extract species name based on stitle information?

ADD REPLY
1
Entering edit mode

Create .ncbirc file in your $home (nano ~/.ncbirc) and paste the following there (obviously change the path to the correct one). Assuming you're using a prebuilt database (in contrast to making one yourself from a fasta file), the taxdb addition should now work as long as it and your db are in the dir where BLASTDB points..

[BLAST]
BLASTDB=/path/to/db
ADD REPLY
0
Entering edit mode

thanks. I creat this file, but the problem still exist. Also, when I try to apply the source ~/.ncbirc, it gave me the error :

[BLAST]:command not found

while blast program is in the PATH and for example blastx can called from anywhere. I'm totally confused. What do you think dear friend?

ADD REPLY
0
Entering edit mode

I think the blast binaries source the ~/.ncbirc file before execution. Can you cat ~/.ncbirc and ls -la /path/to/your/db/dir

ADD REPLY
0
Entering edit mode

Yes, cat ~/.ncbirc show what you suggested me to writ that is:

[BLAST]
BLASTDB=/home/seta/software/ncbi-blast-2.2.30+/bin/dbtest

and the output of second command is showing the content of the related directory.

Please let me know if there is any command to extract and count species based on stitle information?

ADD REPLY
0
Entering edit mode

So the preformatted db and taxdb have been extracted to /home/seta/software/ncbi-blast-2.2.30+/bin/dbtest/ and you're blasting against the db file that is in that dir with blast version 2.2.30+?

which blastx returns /home/seta/software/ncbi-blast-2.2.30+/bin/blastx ?

Is /home right in your root? I don't recall what is in stitle, but as I wrote above, you can link gi to taxonomy..

ADD REPLY
0
Entering edit mode

Yeah, that's exactly right. Also, which blastx return the output like what you wrote in the post. I'm working on a server as a regular user and not in the root position.

the stitle information is something like this line

PREDICTED: zinc finger CCCH domain-containing protein 53-like isoform X1 [Citrus sinensis]

which the species name was written in the bracket. I'm newbie in this field, but I think there should be a command to extract the species name with their counts, say Citrus sinensis 4. As I see, you are so expert. Please let me to ask again for your help. Many thanks

ADD REPLY
0
Entering edit mode
Which field number is stitle in your output?
ADD REPLY
0
Entering edit mode

It's in the column of 17.

ADD REPLY
0
Entering edit mode

So, for example fields 1-12 and species name:

paste <(cut -f1-12 -d $'\t' blast.tsv) <(cut -f17 -d $'\t' blast.tsv | rev | cut -f1 -d "[" | sed 's/^]//' | rev) > fields1-12-and-species

Or if you're just interested in counts:

cut -f17 -d $'\t' blast.tsv | rev | cut -f1 -d "[" | sed 's/^]//'| rev | sort | uniq -c | sort -k1,1gr

Note, for this to be reasonable, your output should be sorted for best hits per query. But how can you sort blastx output for best hits when different ORFs from the same contig have the same ID in column 1? I guess you could parse the coordinates of ORFs into this and then require no overlap (although sometimes there's overlap), but this is starting to be too complicated for a one-liner. This is one of the main reasons why I prefer protein prediction + blastp over blastx. Another is speed.

Also, I can't stress enough how this is a very poor way to assign taxonomy to sequences. For example, I recently blasted some bacterial proteome against nr, removed selfhits, sorted for best hits, and then checked distribution of species. Instead of just one, there were dozens and dozens spanning multiple bacterial phyla. I think a few hits were even to archaea. This is due to 1) horizontal gene transfer and 2) because there's no hand of God that forces proteins to evolve exactly the same way. Just because two species are deemed the closest relatives due to the highest similarity of the 16S rRNAs doesn't mean that every single one of their proteins will represent reciprocal blast hits. LCA method with a reasonable bit score cutoff (like 0.9X best hit) is a much better way to assign taxonomy to sequences..

ADD REPLY
0
Entering edit mode

hanks for being with me. I applied your command on the blast output, but it generates the normal tabular format of output, like below:

contig10        gi|657979668|ref|XP_008381807.1|        85.00   480     70      2       1558    122     1       479     0.0       857
contig100       gi|255565025|ref|XP_002523505.1|        87.50   184     23      0       909     358     4       187     5e-115    343

My original output is like here, which the stitle information is the last column,

contig10        gi|657979668|ref|XP_008381807.1|        85.00   480     70      2       1558    122     1       479     0.0       857   479     N/A     N/A     PREDICTED: histidine decarboxylase [Malus domestica]
contig100       gi|255565025|ref|XP_002523505.1|        87.50   184     23      0       909     358     4       187     5e-115    343   187     N/A     N/A     apolipoprotein d, putative [Ricinus communis]

I need help to extract the species name (appear in the bracket in stitle filed) with their counts in this output.

ADD REPLY
0
Entering edit mode

It's in column 16, not 17. Anyway, since it's the last column you can just:

rev blast.tsv | cut -f1 -d "[" | sed 's/^]//' | rev | sort | uniq -c | sort -k1,1gr

But really, this is a very bad way to assign taxonomy to sequences and will overestimate the actual number of species like at least 20 fold. More than that, I doubt that you have somehow managed to sort your blastx output for best hits in a reasonable way so your result is the species distribution of number of hits against each query.

edit. Well maybe with transcriptome you can expect that the majority of the contigs represent single protein mRNAs instead of operons? Median lenght of transcripts could give you a clue? So you could sort based on column1 first, which would then restrict it to one protein per contig..

ADD REPLY
0
Entering edit mode

Thanks so much for your help and time. I'm glad to hear why you believed it's bad way and may overstimate the real counts?

Given your experience, I may shift to LCA as you suggested. However, it has not been mentioned the way in many papers, usually, author just show the species distribution in the pie chart.

ADD REPLY
0
Entering edit mode

No problem. Also, please no piecharts. When there are like 10 different hues of every color nobody can ever gain anything from piecharts. I wish all journals would just ban them. Heatmaps are a far superior way to represent species distribution, IMO.

Also, you can read a little bit about best, representative and lca hits in e.g. mg-rast manual: ftp://ftp.metagenomics.anl.gov/data/manual/mg-rast-tech-report-v3_r1.pdf

ADD REPLY
0
Entering edit mode

Thanks for your comment. I got the best hit from tabular format output (6) using the command:

export LANG=C; export LC_ALL=C; sort -k1,1 -k12,12gr -k11,11g -k3,3gr blastout.txt | sort -u -k1,1 --merge > bestHits

Please let me know it's a right way?

I just found your sentences about your preference for ORF prediction+blastp over blastx. There is a lot of program for ORF finding, I'll be happy to know which program you use for ORF prediction?

So, the above command to extract best hit can not be right, yes?

ADD REPLY
0
Entering edit mode

As long as you're fine with discarding everything but the highest scoring ORF from operons and possible other mRNAs that include multiple ORFs (and from misassemblies in which two or more mRNAs are joined somehow).

ADD REPLY
0
Entering edit mode

As I told earlier, I'm new in this filed and talking with you is really helpful for me, thanks. I would be highly appreciate if you could please let me know which ORF prediction program you used, then what is your command to get the best hits? I'm working on plant.

ADD REPLY
0
Entering edit mode

No problem. I learn a lot from the Internet as well. The command you posted for best hits gives preference to bitscore > evalue > identity. I think it's the best option for rudimentary blast-based functional annotation. But for relative taxa abundance in protein space, the LCA method is IMO more reasonable. Instead of species level, which is unreasonable expectation because of what I wrote earlier, the LCA results are a mix of up to genus, family, etc. taxonomic rank annotations. In this context, you can think of proteins that are associated with low taxonomic rank LCAs as marker genes for those taxa. I think one of the best uses of metatranscriptomic reads is mapping coverage of corresponding metagenomic assemblies. I am not going to advertise any particular algorithm for protein prediction but prodigal has very convenient output options.

ADD REPLY
0
Entering edit mode

Thanks for backing to me. As I see, the prodigal tool designed for prokaryotic genome. I'm working on plants, so it can not be suitable for me. You know, there is a lot of ORF prediction tool, which each of them has some pros and cons that I have any experience about them, so it's a bit difficult to choose the tool. Could you please let me to have your email address just for urgent question and consultation with you?

Thanks again for all helpful comment

ADD REPLY
0
Entering edit mode

I can't tell you anything about plants. However, I'm sure there's relative literature out there for you. Or make more specific questions here and maybe somebody who knows will help you out. Before I give you my email, I need to know how many $$$ an hour you offer for consultation. For urgent matters hourly rates start at $500. Weekday out of office hours rate is $750. Weekends $1000. Special rates for longer projects.

ADD REPLY
0
Entering edit mode

Thanks friend. I'm just a student and you're like a perfect teacher, I don't think $$. Also, as I told you I'm working on plants, I request your email for consultation and especially introduce you to one of my friends hardly busy with metagenomics and metatranscriptomics. Again thanks for all your help.

ADD REPLY
0
Entering edit mode
9.2 years ago
Sishuo Wang ▴ 230

I think that this script might be helpful.

You can extract species names from your blast output one by one and search for their detailed taxonomic information using the script.

Please make sure that the package DBI is installed prior to using the script. In addition, BioPerl is optional.

An example is

perl search_taxa_info.pl --taxa Arabidopsis --concise --rank all

The output is like:

 Arabidopsis    Eukaryota    Viridiplantae    Streptophyta    None    Brassicales    Brassicaceae    Arabidopsis    None

Please type perl search_taxa_info.pl for its detailed usage.

Alternatively, The following script is designed to extract sequences and the taxonomic information (as well as ACCN or GI) from the output of BLAST downloaded from NCBI in the format of FASTA. Type perl process_from_blast_seq.pl -h to get the usage.

Get process_from_blast_seq.pl here.

Please write e-mail to the e-mail address shown in the README if your have any question.

ADD COMMENT

Login before adding your answer.

Traffic: 1544 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6