Assigning subject to query based on coverage from blast output file
5
0
Entering edit mode
10.3 years ago
Prakki Rama ★ 2.7k

Dear all,

I have blast result in tabulated format. Each of the query hits multiple subject sequences. From all of those hits, I want to assign that subject sequence which covers the query sequence the most. Does any of you have a way to achieve this. I would really appreciate your ideas and resources.

EDIT: I should have been much clear in my question. I have the following file. The best hit is uti_contig0036, which covers more than "uti_contig7141" in the query.

contig996     uti_contig0036     80.4      903     110     34     7491      8352      2709010     2709886     1.00E-175    625
contig996     uti_contig0036     85.28     197     20      3      7793      7986      6423228     6423418     3.00E-46     195
contig996     uti_contig0036     76.76     327     46      15     7868      8168      6388503     6388181     2.00E-34     156
contig996     uti_contig0036     88.51     87      9       1      8203      8288      855009      855095      6.00E-19     104
contig996     uti_contig0036     93.44     61      3       1      8228      8288      3695407     3695348     2.00E-14     89.8
contig996     uti_contig0036     93.75     48      3       0      23516     23563     2295958     2295911     2.00E-09     73.1
contig996     uti_contig7141     80.22     819     97      35     7481      8280      2616628     2617400     1.00E-154    555
contig996     uti_contig7141     81.26     619     66      20     7721      8320      468207      467620      1.00E-124    455
contig996     uti_contig7141     80.25     486     67      19     7889      8352      3620643     3620165     1.00E-89     339
contig996     uti_contig7141     86.59     246     22      4      7677      7922      863436      863670      3.00E-66     261

So the required output is something like this:

"contig996" "uti_contig7141" "2170" (addition of 819,619,486,246). "contig996" "uti_contig7141" "871" (8352-7481), whereas uti_contig0036 covers only 861 bases in contig996 (8352-7491). But my worry is, If i just add up the alignment lengths, I might over exaggerate the coverage. For example, check the yellow lines, where if I cannot add the alignment length because it is already embedded in other aligned region of the same contig. I hope there is some easy way to do this. Thanks you for your ideas.

blast • 4.8k views
ADD COMMENT
3
Entering edit mode
10.3 years ago

Let's try again, if still relevant... The pipeline below merges the query intervals within the same subject, calculate the sum of query intervals (within subject) and picks the query with largest sum of intervals.

awk 'BEGIN{OFS="\t"}{print $1"_vs_"$2, $7, $8}' blast.txt \
| sort -k1,1 -k2,2n \
| mergeBed \
| awk '{print $0 "\t" $3-$2+1}' \
| groupBy -g 1 -c 4 -o sum \
| awk 'sub("_vs_", "\t", $1)' \
| sort -k1,1 -k3,3nr \
| awk '$1 != x {print $0} {x = $1}'

From the example data you posted the non-redundant sum of query intervals is (script above without last line):

contig996    uti_contig0036 908
contig996    uti_contig7141 871
contig996    uti_contig 300

Including the last line the output is contig996 uti_contig0036 908. Does it get closer to what you need?

mergeBed and groupBy come from bedtools

ADD COMMENT
1
Entering edit mode

Another small correction after 2.5 years:

The blast results are one-based coordinate system but the bed format is zero-based for the coordinate start and one-based for the coordinate end.

So, accordingly the first line in the above script should be :

awk 'BEGIN{OFS="\t"}{print $1"_vs_"$2, $7-1, $8}' blast.txt

This brings the Blast results to zero-based coordinate to bed format as described above.

Edit 1: In such case, we need not add 1 base as in : $3-$2+1 instead it can be $3-$2 because it is now zero coordinate system. So basically, the above script should be as follows :

awk 'BEGIN{OFS="\t"}{print $1"_vs_"$2, $7-1, $8}' blast.txt \ ## changing to bed format
| sort -k1,1 -k2,2n \
| mergeBed \
| awk '{print $0 "\t" $3-$2}' \ # removed the addition of 1 base
| groupBy -g 1 -c 4 -o sum \
| awk 'sub("_vs_", "\t", $1)' \
| sort -k1,1 -k3,3nr \
| awk '$1 != x {print $0} {x = $1}'
ADD REPLY
0
Entering edit mode

Excellent!! This is what I exactly want. The sum of the lengths of aligned regions of the query ignoring the embedded aligned regions. Fantastic. Thank you very much for your valuable resource.

ADD REPLY
0
Entering edit mode

Ok, good. Make sure it is doing the right thing as I haven't tested it. You might have noticed that query and subject names are first joined and then split using _vs_ as joiner. So if your query or subject names contain this string, things will go wrong. You can check for the presence of this string in your input with something like cut -f1,2 blast.txt | grep '_vs_'(which should return no lines)

ADD REPLY
1
Entering edit mode

a small correction after 4 years : awk '{print $0 "\t" $3-$2}' from above should be awk '{print $0 "\t" $3-$2+1}'. Otherwise, the length of all the alignments will be 1bp less.

ADD REPLY
0
Entering edit mode

Well spotted! Corrected now.

ADD REPLY
1
Entering edit mode
10.3 years ago
5heikki 11k

Assuming outfmt 6, alignment length is in the 4th column, so:

export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8

sort -k1,1 -k4,4gr -k12,12gr -k11,11g -k3,3gr blastout.txt | sort -u -k1,1 --merge > bestHits.txt

This sorts by 1) query, 2) alignment length, 3) bitscore, 4) evalue, 5) %identity, and then after the pipe by unique query.

ADD COMMENT
0
Entering edit mode

Once you mention LC_ALL, you might consider setting it to LC_ALL=C as it might make a big difference in speed. E.g. test.txt has 1 million lines:

export LC_ALL=en_US.UTF-8
time sort test.txt > /dev/null

real    0m23.471s
user    0m23.423s
sys    0m0.047s

export LC_ALL=C
time sort test.txt > /dev/null

real    0m0.959s
user    0m0.913s
sys    0m0.046s
ADD REPLY
0
Entering edit mode

Provided LC_ALL=C works correctly with scientific notation, this is a good notation :)

ADD REPLY
0
Entering edit mode

Thank you for the answer. Sorry for my unclear question. Please check my Edit.

ADD REPLY
0
Entering edit mode

I see. Now the problem is more complicated.

"contig996" "uti_contig7141" "2170" (addition of 819,619,486,246)

Is not correct though since those hits overlap: column 7 (query start) & 8 (query end).

ADD REPLY
0
Entering edit mode

Exactly!! I cannot even add. Some regions are already embedded in other regions. I just have to ignore them and just see how much of the query is covered.

ADD REPLY
1
Entering edit mode
10.3 years ago
edrezen ▴ 730

Hello,

If by coverage you mean the size of the match in the query (column8-column7+1 in the tabular output), you can try the following:

cat blastresult | awk ' BEGIN{query=""; cover=0} {if (query!=$1) {if (query!="") {print align}; query=$1; cover=0;}; if ($8-$7>cover) {cover=$8-$7; align=$0;} } END{print align}' 

It should work as well with gawk if you don't have awk.

ADD COMMENT
0
Entering edit mode

Thank you. Apologies for my unclear question. Please check my EDIT.

ADD REPLY
1
Entering edit mode
10.3 years ago

What about this, using bedtools groupby:

sort -k1,1 -k2,2 blast.txt \
| groupBy -g 1,2 -c 4 -o sum \
| sort -k1,1 -k3,3nr \
| awk '$1 != x {print $0} {x = $1}'

And since it was mentioned above, set export LC_ALL=C

NB: This takes the sum of the length (column 4) regardless of possible overlap. Is it what you want?

ADD COMMENT
0
Entering edit mode

Thank you. But, the problem seems more complicated. Some regions are embedded in other aligned regions. I have to ignore those regions.

ADD REPLY

Login before adding your answer.

Traffic: 1767 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6