Find protein homologs with BLASTp
2
2
Entering edit mode
10.0 years ago
biotech ▴ 570

I'm trying to find homologs of a set of proteins using BLASTp. I'm working with custom databases.

I'm using evalue of 0.00001 as threshold.

I would like to filter queries having hits with >90% identities. Because BLASTp output is based on HSPs, I cannot filter by %identities/query, only by HSP.

I would like to know how to do this and also if I'm following a reasonable strategy.

Thanks, Bernardo

Here is an alignment example: qcovs=100 but qcovhsp lower.

qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qcovs qcovhsp
HPNK_01698      HAPS_0519       81.88   596     75      5       630     1225    615     1177    0.0      889    100     49
HPNK_01698      HAPS_0519       49.17   301     115     8       84      366     201     481     2e-56    214    100     23
HPNK_01698      HAPS_0519       53.64   261     61      6       436     684     616     828     6e-49    191    100     20
HPNK_01698      HAPS_0519       46.61   251     62      3       332     510     584     834     6e-46    181    100     15
HPNK_01698      HAPS_0519       53.27   214     79      4       1       194     1       213     1e-45    180    100     16
HPNK_01698      HAPS_0519       55.96   218     60      8       550     764     643     827     1e-40    164    100     18
HPNK_01698      HAPS_0519       51.56   225     61      7       516     731     642     827     1e-38    157    100     18
HPNK_01698      HAPS_0519       49.57   230     77      6       484     713     643     833     1e-37    154    100     19
HPNK_01698      HAPS_0519       57.89   76      26      1       364     433     760     835     1e-13   76.3    100     6

Code used:

# blast database
$makeblastdb -in $Hparasuisfastadatabase \
  -out H_parasuis_strains_gb_ALL.fna_databaseBLAST \
  -dbtype prot \
  -parse_seqids
$blastp -db H_parasuis_strains_gb_ALL.fna_databaseBLAST \
  -query 'out_2.fasta' \
  -out HPNK_selected_vs_H_parasuis_strainss.tblastn \
  -evalue 0.00001 \
  -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qcovs qcovhsp" \
  -max_target_seqs 50
blast homology • 3.7k views
ADD COMMENT
1
Entering edit mode

Hello biotech!

It appears that your post has been cross-posted to another site: http://biology.stackexchange.com/questions/23953/find-protein-homologs-with-blastp

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLY
1
Entering edit mode
10.0 years ago
Siva ★ 1.9k

As 5heikki pointed out correctly, you cannot get what you are looking for directly from the BLAST output. If you really want to sort the BLAST hits by "global similarity" and select only top scoring hits, there is a work around using BLAST score which corresponds to similarity.

1. If there is only one HSP for a subject/hit, it is the score for that subject sequence.

2. If there are more more than one HSP for a subject:

2a. For non-overlapping HSPs: Add the scores of all non-overlapping HSPs

2b. For overlapping HSPs: Take the sum or the maximum score of the overlapping HSPs

3. Sort the subjects based on score and select high scoring subjects with whatever threshold you set.

ADD COMMENT
0
Entering edit mode
10.0 years ago
5heikki 11k

You mean you want percent identities of global alignments? If yes, Basic Local Alignment Search Tool is not the tool you're looking for. Also, beware that e-value depends on database size..

ADD COMMENT

Login before adding your answer.

Traffic: 1674 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6