Dear all,
I have blast result in tabulated format. Each of the query hits multiple subject sequences. From all of those hits, I want to assign that subject sequence which covers the query sequence the most. Does any of you have a way to achieve this. I would really appreciate your ideas and resources.
EDIT: I should have been much clear in my question. I have the following file. The best hit is uti_contig0036
, which covers more than "uti_contig7141" in the query.
contig996 uti_contig0036 80.4 903 110 34 7491 8352 2709010 2709886 1.00E-175 625
contig996 uti_contig0036 85.28 197 20 3 7793 7986 6423228 6423418 3.00E-46 195
contig996 uti_contig0036 76.76 327 46 15 7868 8168 6388503 6388181 2.00E-34 156
contig996 uti_contig0036 88.51 87 9 1 8203 8288 855009 855095 6.00E-19 104
contig996 uti_contig0036 93.44 61 3 1 8228 8288 3695407 3695348 2.00E-14 89.8
contig996 uti_contig0036 93.75 48 3 0 23516 23563 2295958 2295911 2.00E-09 73.1
contig996 uti_contig7141 80.22 819 97 35 7481 8280 2616628 2617400 1.00E-154 555
contig996 uti_contig7141 81.26 619 66 20 7721 8320 468207 467620 1.00E-124 455
contig996 uti_contig7141 80.25 486 67 19 7889 8352 3620643 3620165 1.00E-89 339
contig996 uti_contig7141 86.59 246 22 4 7677 7922 863436 863670 3.00E-66 261
So the required output is something like this:
"contig996" "uti_contig7141" "2170" (addition of 819,619,486,246). "contig996" "uti_contig7141" "871" (8352-7481), whereas uti_contig0036
covers only 861 bases in contig996
(8352-7491). But my worry is, If i just add up the alignment lengths, I might over exaggerate the coverage. For example, check the yellow lines, where if I cannot add the alignment length because it is already embedded in other aligned region of the same contig. I hope there is some easy way to do this. Thanks you for your ideas.
Another small correction after 2.5 years:
The blast results are one-based coordinate system but the bed format is zero-based for the coordinate start and one-based for the coordinate end.
So, accordingly the first line in the above script should be :
This brings the Blast results to zero-based coordinate to bed format as described above.
Edit 1: In such case, we need not add 1 base as in : $3-$2+1 instead it can be $3-$2 because it is now zero coordinate system. So basically, the above script should be as follows :
Excellent!! This is what I exactly want. The sum of the lengths of aligned regions of the query ignoring the embedded aligned regions. Fantastic. Thank you very much for your valuable resource.
Ok, good. Make sure it is doing the right thing as I haven't tested it. You might have noticed that query and subject names are first joined and then split using
_vs_
as joiner. So if your query or subject names contain this string, things will go wrong. You can check for the presence of this string in your input with something likecut -f1,2 blast.txt | grep '_vs_'
(which should return no lines)a small correction after 4 years :
awk '{print $0 "\t" $3-$2}'
from above should beawk '{print $0 "\t" $3-$2+1}'
. Otherwise, the length of all the alignments will be 1bp less.Well spotted! Corrected now.