Hi all,
I made BLASTp analysis on a specific genome against large databases. Here is my command line for example:
blastp -query SEQUENCE.fasta -db database -out blast_results -outfmt 6 -evalue 0.01 -max_target_seqs 5
As expected, my output files are very large and I would like to select only the matches with a percentage of identical matches of 60 or higher.
Here is an example of an output file with only the first few lines:
Prokka_00003 VFDB_prot_DB_VFG012837(gi:82524546)_(ipaH2.5)_invasion_plasmid_antigen,_fragment_[Mxi-Spa_TTSS_effectors_controlled_by_MxiE_(CVF465)]_[Shigella_dysenteriae_Sd197] 28.947 114 77 3 134 247 342 451 7.17e-05 44.7
Prokka_00007 VFDB_prot_DB_VFG042740(gi:15599489)_(pprA)_two-component_sensor_PprA_[type_IV_pili_(AI098)]_[Pseudomonas_aeruginosa_PAO1] 29.496 278 175 5 463 735 654 915 8.18e-20 94.7
Prokka_00007 VFDB_prot_DB_VFG038759(gb|YP_857329)_(fleS/flrB)_two-component_sensor_kinase_[ND_(AI144)]_[Aeromonas_hydrophila_subsp._hydrophila_ATCC_7966] 24.194 372 224 11 366 734 17 333 4.81e-19 89.4
Prokka_00007 VFDB_prot_DB_VFG038760(gb|YP_008044159)_(flrB)_flagellar-related_two-component_sensor_kinase_[Polar_flagella_(VF0473)]_[Aeromonas_hydrophila_ML09-119] 24.194 372 224 11 366 734 17 333 6.71e-19 89.0
Prokka_00007 VFDB_prot_DB_VFG038762(gi:330830289)_(flrB)_two_component_system_flagellar_sensor_histidine_kinase_FlrB_[Polar_flagella_(CVF786)]_[Aeromonas_veronii_B565] 26.141 241 163 5 497 734 105 333 7.11e-18 85.9
Prokka_00007 VFDB_prot_DB_VFG038761(gi:145298503)_(flrB)_two-component_system_flagellar_sensor_histidine_kinase_FlrB_[Polar_flagella_(CVF786)]_[Aeromonas_salmonicida_subsp._salmonicida_A449] 25.311 241 165 5 497 734 105 333 1.28e-17 85.1
Do you know how would I be able to do that?
Thank you so much for your much appreciated help!!
Have a nice day
word of caution: these values are on a per match/HSP basis (so do not confuse them with a per hit basis). This means that in some cases you might get two (or more) values for the same query-hit pair.
getting the % identity for each query-hit pair is quite tricky, but in most cases the proxy- approach of Michael Dondrup will do indeed (and I would do the same tbh)