I'm trying to find homologs of a set of proteins using BLASTp. I'm working with custom databases.
I'm using evalue of 0.00001 as threshold.
I would like to filter queries having hits with >90% identities. Because BLASTp output is based on HSPs, I cannot filter by %identities/query, only by HSP.
I would like to know how to do this and also if I'm following a reasonable strategy.
Thanks, Bernardo
Here is an alignment example: qcovs=100 but qcovhsp lower.
qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qcovs qcovhsp
HPNK_01698 HAPS_0519 81.88 596 75 5 630 1225 615 1177 0.0 889 100 49
HPNK_01698 HAPS_0519 49.17 301 115 8 84 366 201 481 2e-56 214 100 23
HPNK_01698 HAPS_0519 53.64 261 61 6 436 684 616 828 6e-49 191 100 20
HPNK_01698 HAPS_0519 46.61 251 62 3 332 510 584 834 6e-46 181 100 15
HPNK_01698 HAPS_0519 53.27 214 79 4 1 194 1 213 1e-45 180 100 16
HPNK_01698 HAPS_0519 55.96 218 60 8 550 764 643 827 1e-40 164 100 18
HPNK_01698 HAPS_0519 51.56 225 61 7 516 731 642 827 1e-38 157 100 18
HPNK_01698 HAPS_0519 49.57 230 77 6 484 713 643 833 1e-37 154 100 19
HPNK_01698 HAPS_0519 57.89 76 26 1 364 433 760 835 1e-13 76.3 100 6
Code used:
# blast database
$makeblastdb -in $Hparasuisfastadatabase \
-out H_parasuis_strains_gb_ALL.fna_databaseBLAST \
-dbtype prot \
-parse_seqids
$blastp -db H_parasuis_strains_gb_ALL.fna_databaseBLAST \
-query 'out_2.fasta' \
-out HPNK_selected_vs_H_parasuis_strainss.tblastn \
-evalue 0.00001 \
-outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qcovs qcovhsp" \
-max_target_seqs 50
Hello biotech!
It appears that your post has been cross-posted to another site: http://biology.stackexchange.com/questions/23953/find-protein-homologs-with-blastp
This is typically not recommended as it runs the risk of annoying people in both communities.