Best Hits From Blast Tabular Output Of Multiple-Queries
3
4
Entering edit mode
11.9 years ago
Rohit ★ 1.5k

Hello.

I have been trying to get the best hits out of blast output files. I know that -v option or maxtargetseqs in standalone blast gives the best ones. But I already have a huge output of multiple queries which have to be sorted according to best cut-off values. I am not using XML parser as my output is already tabular.

Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| Escherichia coli DH1 DNA,... 100.00 280 0 0 1 280 3402569 3402290 4e-140 506
Locus3035v2rpkm3.64 gi|87248160|gb|DQ311188.1| Bombyx mori hydroxymethylglutaryl-Co... 98.94 94 1 0 1 94 523 616 4e-37 165
Locus303v1rpkm4.98 gi|346706507|dbj|AK380484.1| Bombyx mori mRNA clone: ftes50G24 100.00 159 0 0 1 159 4395 4553 1e-74 288

This is the tabular output I have. I have more than one criterion to take. I am looking for hits such that my the alignment length should be high but no compromise on the sequence identity. E-value should be less than 1e-06, less gaps and mismatches.

I tried to introduce a score such that - score = (alignmentlength) / ( 101- %identity) s = L / (101 - I%)

I used 101 as if 100 is taken there can be undefined values if completely identical. But I am missing the point for alignment length here. How can I select such that all these criterion are followed to give best hits in the following order of preference -

  1. Alignment length
  2. %identity
  3. less gaps and mismatches
  4. e-value 1e-06

Regards,
Rohit

statistics blast • 12k views
ADD COMMENT
0
Entering edit mode

Play around with awk, sort and sed. It can be easily achieved. Alternatively try biopython.

ADD REPLY
0
Entering edit mode

I don't get it. Why not to take first hit from each BLAST output? Hits are sorted by decreasing score which depends on the alignment length: the higher score, the longer the alignment.

ADD REPLY
0
Entering edit mode

Yes but the top scoring alignment may not align the whole query sequence. In that case, one has to select several alignments (Threading). It all depends on the research question to address.

ADD REPLY
7
Entering edit mode
11.9 years ago

In Salse et al. 2009 they introduced two parameters to allow the identification of the best BLAST alignment - the highest cumulative percentage of identity in the longest cumulative length.

The first parameter, cumulative identity percentage (CIP) corresponds to the cumulative percentage of sequence identity obtained for all the HSPs (CIP = [∑ ID by HSP/AL] × 100) where AL is the sum of all hsp lengths.

The second parameter, CALP, is the cumulative alignment length percentage which represents the sum of the HSP lengths (AL) for all the HSPs divided by the length of the query sequence (CALP = AL/query).

ADD COMMENT
1
Entering edit mode

This scoring I ought to try. Thank you for the advice.

Think taking the HSP and the alignments makes a huge difference when trying to work on my data.

Regards, Rohit

ADD REPLY
1
Entering edit mode
11.9 years ago
Pavel Senin ★ 1.9k

Maybe there is an easier way, but what I do in similar situations - is to dump my BLAST output into MySQL at first using BioPerl. Secondly, I use Java/MyBATIS to iterate over query sequences spawning custom best hit search threads over the DB. I have some code around, but you will need to adjust it for your own needs.

Overall, I found that database provide a lot of advantage when used this way - imposing some structure over flat BLAST output - it enables you to slice and dice those huge BLAST outputs any way you like, so it might be worth investing into some code and SQL stuff.

ADD COMMENT
1
Entering edit mode

Thank you.

I will try SQL and PHP when I make a db for my blast hits. That idea was unique btw.

ADD REPLY
1
Entering edit mode

Thank you. Well, in DB world, they successfully solved the problem of structuring and interactively querying large datasets with some "crazy" questions. SQL can do magic in real time - taking away most of that bioinformatics shell scripting. But yes, one need resources to run DB engine and to get some knowledge/experience before getting comfortable.

ADD REPLY
0
Entering edit mode
11.9 years ago

The bit-score takes into account the other criteria you want to consider in that task so I'm wondering why you want to make it more difficult. I could understand you need a min alignment length for post-analysis purpose though, so why don't you filter with awk your results according to a min length and after you keep only the best hits (using the score)?

Awk syntax to keep hits with at least 200 aligned residues:

awk '$4>=200' file.blast

For the latter, I made a little script:

http://code.google.com/p/bioman/source/browse/BlastBestHit.py

I made a more elaborate one intended to do the same task for hmmer table outputs, with which you can choose a min score cutoff. As you can choose your score column in the options, it should work on a blast output with -c 12 .

http://code.google.com/p/bioman/source/browse/HMMerBestHit.py

In case you want to use your home-made score (which seems weird to me as, again, it's a combination of 2 non independent criteria), you can compute it for every hit (awk is once again your best friend) and insert it as a new column, and use -c new_column_number to keep the best hit based on that criterium.

As mentioned in the source, both of my scripts assume the queries are grouped together (which I think is a "classical" blast or hmmscan output, but not hmmsearch for instance). If you prefer using e-values instead of the score, you would have to modify my script because one increases as the other decreases. However, if you find out what score corresponds to your e-value in your specific search, you can you that score and then obtain what you want.

ADD COMMENT
2
Entering edit mode

I did not try that home-made score, but I guess using the alignment and HSP will help a lot.

But your work is bound to help me in the latter stages too. Thank you for the code. I have a small-script which will help in extracting according to the various cut-offs we need. I have no idea where to post it though.

Regards, Rohit

ADD REPLY
1
Entering edit mode

I am a programmer analphabet. I can not figure out from the HMMerBestHit.py script that if it is able to sort out from a hmmscan output table file only the top HMMer model hit. I want to sort my proteins in to clusters this is the reason why I want to get rid of the query proteins lower hits.

ADD REPLY

Login before adding your answer.

Traffic: 1832 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6