Hello.
I have been trying to get the best hits out of blast output files. I know that -v
option or maxtargetseqs
in standalone blast gives the best ones. But I already have a huge output of multiple queries which have to be sorted according to best cut-off values. I am not using XML parser as my output is already tabular.
Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| Escherichia coli DH1 DNA,... 100.00 280 0 0 1 280 3402569 3402290 4e-140 506
Locus3035v2rpkm3.64 gi|87248160|gb|DQ311188.1| Bombyx mori hydroxymethylglutaryl-Co... 98.94 94 1 0 1 94 523 616 4e-37 165
Locus303v1rpkm4.98 gi|346706507|dbj|AK380484.1| Bombyx mori mRNA clone: ftes50G24 100.00 159 0 0 1 159 4395 4553 1e-74 288
This is the tabular output I have. I have more than one criterion to take. I am looking for hits such that my the alignment length should be high but no compromise on the sequence identity. E-value should be less than 1e-06, less gaps and mismatches.
I tried to introduce a score such that - score = (alignmentlength) / ( 101- %identity) s = L / (101 - I%)
I used 101 as if 100 is taken there can be undefined values if completely identical. But I am missing the point for alignment length here. How can I select such that all these criterion are followed to give best hits in the following order of preference -
- Alignment length
- %identity
- less gaps and mismatches
- e-value 1e-06
Regards,
Rohit
Play around with awk, sort and sed. It can be easily achieved. Alternatively try biopython.
I don't get it. Why not to take first hit from each BLAST output? Hits are sorted by decreasing score which depends on the alignment length: the higher score, the longer the alignment.
Yes but the top scoring alignment may not align the whole query sequence. In that case, one has to select several alignments (Threading). It all depends on the research question to address.