How To Extract Max Score Blast Hit Among A Data Set?
1
1
Entering edit mode
10.8 years ago
HG ★ 1.2k

Hi everyone, I performed a local blastp with a query protein vs 5 genome (ORF data ). Now i want to extract all the top 5 sequence automatically and make a final top hit file through awk/perl/shell. Can any one please help me out how to do it.

    # BLASTP 2.2.25 [Feb-01-2011]
# Query: gnl|ECOLI|EG10158-MONOMER ClpP 455901..456524 Escherichia coli K-12 substr. MG1655
# Database: mybd
# Fields: Query id, Subject id, % identity, alignment length, mismatches, gap openings, q. start, q. end, s. start, s. end, e-value, bit score
gnl|ECOLI|EG10158-MONOMER    V161_10_length_174349_cov_20.0403_ID_19_63    100.00    207    0    0    1    207    1    207    9e-157     431
gnl|ECOLI|EG10158-MONOMER    RS374_6_length_260428_cov_19.5118_ID_11_179    100.00    207    0    0    1    207    1    207    9e-157     431
gnl|ECOLI|EG10158-MONOMER    H127_1_length_711606_cov_31.5057_ID_1_431    100.00    207    0    0    1    207    1    207    9e-157     431
gnl|ECOLI|EG10158-MONOMER    sff_11_length_151555_cov_28.2432_ID_21_27    100.00    207    0    0    1    207    1    207    9e-157     431
gnl|ECOLI|EG10158-MONOMER    H75_1_length_711552_cov_31.2637_ID_1_431    100.00    207    0    0    1    207    1    207    9e-157     431
gnl|ECOLI|EG10158-MONOMER    H75_26_length_34692_cov_38.2655_ID_51_32    28.24    131    90    2    73    202    63    190    3e-11    60.8
gnl|ECOLI|EG10158-MONOMER    RS374_42_length_11164_cov_52.9994_ID_83_6    28.24    131    90    2    73    202    63    190    4e-11    60.8
gnl|ECOLI|EG10158-MONOMER    H127_2_length_512186_cov_33.056_ID_3_271    26.05    119    67    5    22    121    302    418    5.6    26.9
gnl|ECOLI|EG10158-MONOMER    H75_3_length_319828_cov_30.2145_ID_5_224    26.05    119    67    5    22    121    302    418    5.6    26.9
gnl|ECOLI|EG10158-MONOMER    V161_3_length_303745_cov_18.9698_ID_5_105    26.05    119    67    5    22    121    302    418    6.3    26.9
gnl|ECOLI|EG10158-MONOMER    RS374_2_length_370043_cov_19.1337_ID_3_206    26.27    118    66    5    23    121    303    418    6.4    26.9
gnl|ECOLI|EG10158-MONOMER    sff_1_length_375148_cov_28.8962_ID_1_190    26.27    118    66    5    23    121    303    418    6.4    26.9
gnl|ECOLI|EG10158-MONOMER    RS374_4_length_285072_cov_19.5222_ID_7_219    30.30    33    23    0    128    160    176    208    7.2    26.6
blast awk perl • 3.6k views
ADD COMMENT
3
Entering edit mode

Have you tried solving this yourself?
How can I sort blast output based on e-value?
I would use linux sort and head.

ADD REPLY
1
Entering edit mode

I appreciate sort and head are useful command. but for my purpose i dont know how much it will be useful. In context of my question if you got properly i need top 5 max score sequence in a separate file so i can pass those sequences in next phase of my pipe.
for example i need : star position to end position of each top 5 sequences.

ADD REPLY
1
Entering edit mode

Getting top 10 sequences of BLAST results Bio Python have a look here. Also, you may use python script taking only top 5 scores

ADD REPLY
2
Entering edit mode
10.8 years ago
5heikki 11k
sort -k1,1 -k12,12gr -k11,11g output | sort -u -k1,1 --merge > best
grep -v -f best output > outputWithoutBest
sort -k1,1 -k12,12gr -k11,11g outputWithoutBest | sort -u -k1,1 --merge > secondBest
etc.
cat best secondBest .. | sort -k1,1 > best5

There's probably a one-liner but I'm too lazy to construct it. Also, stop being lazy. Google would have known this..

ADD COMMENT

Login before adding your answer.

Traffic: 1724 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6