I'm actually trying to find some homology between sequences inside a genome and several viral proteins in a database. As you all know, virus evolve very quickly, so I'm expecting to find a lot a divergences between the sequences I compared.
In the literature I see that BLOSUM < 62 are used for relatively divergences sequences and the same for PAM matrix around 250.
I tried all the matrix available and here are the results for a particular sequence that have to be found by the blastX (a real positive one). The e-value appears very high because the database is quite big I guess ( around 150 000 proteins seqs).
BLOSUM 90
scaf1 104 viral_seq 71.4 14 4 0 401993 401952 19 32 1.8e+01 38.0
BLOSUM 80
scaf1 104 viral_seq 71.4 14 4 0 401993 401952 19 32 2.0e+01 37.9
BLOSUM 65
scaf1 104 viral_seq 21.5 107 78 1 88433 88753 2 102 1.7e+01 38.1
BLOSUM 45
scaf1 104 viral_seq 21.5 107 78 1 88433 88753 2 102 9.8e+01 35.6
PAM250
scaf1 104 viral_seq 21.5 107 78 1 88433 88753 2 102 1.2e+02 35.3
PAM70
scaf1 104 viral_seq 71.4 14 4 0 401993 401952 19 32 1.5e+00 41.7
PAM30
scaf1 104 viral_seq 71.4 14 4 0 401993 401952 19 32 2.4e-01 44.3
So I wondered, since the PAM30 displays the best evalue for the protein, does it mean that it is the best matrix to use for my data?
It is weird because I was expected to find a more relevant result with a matrix for more distantly related species such as BLOSUM45 (but here the e-value is even higher) because I'm working on virus proteins and they evolve very much quickly...
Do you know what are the best options in order to find this hit and keep a correct e-value even if the db is big?
I'm using the program Diamond by the way which is a blast program running faster than blast.
Thank you for your advises.