I have a contig file from the assembly of Illumina shotgun reads of soil samples. I did a local blastx against the virulence gene of different bacteria. After I parsed the blast report using MEGAN, I saw many "assigned" reads have low identity percentage such as this one having only 27%. So my question is why it has such a low expect score when the identity percentage is so low? Can I say the reads are indeed what it is assigned only based on the fact that the expected score is low? Is there an option in MEGAN or BLAST to filter the reads according the identity %?
>VFG002283(gi:18309707) (nanI) exo-alpha-sialidase [sialidase (VF0391)] [Clostridium
perfringens str. 13]
Length=694
Score = 129 bits (323), Expect = 6e-33, Method: Compositional matrix adjust.
Identities = 119/447 (27%), Positives = 188/447 (42%), Gaps = 108/447 (24%)
Frame = +2
Query 80 IGVRHAGDDGVAAYRIPGLVTSNKGTLLGVYDIRYNNSADLQER-VDIGLSRSTDGGQTW 256
+ + H G + YRIP L + +GTL+ D R + AD +D + RS DGG+TW
Sbjct 252 VDLFHPGFLNSSNYRIPALFKTKEGTLIASIDARRHGGADAPNNDIDTAVRRSEDGGKTW 311
Query 257 EPMRVAMTFGEEGGLPSAQNGVGDPAILVDKKTGTIWIVAA--------WTHGMG----- 397
+ ++ M + + ++ V D ++ D +TG I+++ W G+G
Sbjct 312 DEGQIIMDYPD-------KSSVIDTTLIQDDETGRIFLLVTHFPSKYGFWNAGLGSGFKN 364
Query 398 -NGRAWFNSQDGMDKNHTAQ---------------------------------------- 454
+G+ + D K T +
Sbjct 365 IDGKEYLCLYDSSGKEFTVRENVVYDKDGNKTEYTTNALGDLFKNGTKIDNINSSTAPLK 424
Query 455 ------LVLAKSDDDGKTWSNPINITSQVKDPSWKFLLQGPGSGITMQDGT----LVFAT 604
+ L SDDDGKTWS P NI QVK KFL PG GI +++G +V
Sbjct 425 AKGTSYINLVYSDDDGKTWSEPQNINFQVKKDWMKFLGIAPGRGIQIKNGEHKGRIVVPV 484
Query 605 QFIDSTRVPNAGIMYSKDHGKTW----------KMHNYARTNT----------TEAQVAE 724
+ + ++ ++YS D GK W K+ N N+ TE QV E
Sbjct 485 YYTNEKGKQSSAVIYSDDSGKNWTIGESPNDNRKLENGKIINSKTLSDDAPQLTECQVVE 544
Query 725 VEPGVLMLNMRDNRGGSRAVSVTKDLGKTWTEHPSNRSVLQESVCMASLIKVEAKDNVLN 904
+ G L L MR N G ++ + D G TW E + + E C S+I K +
Sbjct 545 MPNGQLKLFMR-NLSGYLNIATSFDGGATWDETVEKDTNVLEPYCQLSVINYSQK--IDG 601
Query 905 KGILLFSNPNTTKGRHSITIKASLDGGL-TFPN---------EYDVLLDEGHGWGYSCLT 1054
K ++FSNPN + R + T++ L + T+ N +Y+ L+ G+ + YSCLT
Sbjct 602 KDAVIFSNPN-ARSRSNGTVRIGLINQVGTYENGEPKYEFDWKYNKLVKPGY-YAYSCLT 659
Query 1055 MIDKETVGILYEGS-TAHMVFQAVKLK 1132
+ +G+LYEG+ + M + + LK
Sbjct 660 ELSNGNIGLLYEGTPSEEMSYIEMNLK 686
% identity as a stand-alone metric will not be very informative. You are not blasting with queries that are guaranteed to be full length sequences so that has to be kept in mind as well. If that xx% identity happens to include a near perfect match over a known domain/active site (see if you can find that for sialidase in the hit above) it would be meaningful but if that is not the case then it may be a random match that just happens to be there.