Hi Everyone.
I am trying to blast many short peptide sequences against a protein database. I am looking for nearly exact matches and selected word_size of 4.
However, there are some hsp matches where the longest stretch of consecutive amino acids that are identical between query and subject is 3. Please can someone clarify why this is, as I though every hsp match sequence should have at least one stretch of identical consecutive amino acids equal or greater than the word_size.
Here is my command that demonstrates this:
blastp -query testPeptide.fasta -matrix PAM30 -outfmt 5 -word_size 4 -subject test.fasta
Query:
">peptide FTDFQGGV"
Subject:
">S507_scaffold13_size114854|S507_scaffold13_size114854_recno_56.0|(+)20770:21546 WVVVDRGVDRGARRAAGSGMQLRPPSGVLHAGAGTAQPVGSAPLAVLITGHDLEPIAAQV TGLAELDRLAKHPGAARPPIGHVPDCPHRAGSPDLAGGDDTGGVVQQGAQRTGRCRRGAQ RRRNDAKTQHARSRRREFEHITPRDRHMPQGTTKTTTVTLVSVVTDASHWQNTCMRPYRH RCGLGQAASPCDHYYGVIAYAPNGAMGKIVAPPHSRPGGYRRIRTLRRLSCKVLSNFTNY HGGVRRSRPLAEPGRATS"
http://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd">
<BlastOutput>
<BlastOutput_program>blastp</BlastOutput_program>
<BlastOutput_version>BLASTP 2.4.0+</BlastOutput_version>
<BlastOutput_reference>Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a n
ew generation of protein database search programs", Nucleic Acids Res. 25:3389-3402.</BlastOutput_reference>
<BlastOutput_db></BlastOutput_db>
<BlastOutput_query-ID>Query_1</BlastOutput_query-ID>
<BlastOutput_query-def>peptide <unknown description=""></BlastOutput_query-def>
<BlastOutput_query-len>8</BlastOutput_query-len>
<BlastOutput_param>
<Parameters>
<Parameters_matrix>PAM30</Parameters_matrix>
<Parameters_expect>10</Parameters_expect>
<Parameters_gap-open>9</Parameters_gap-open>
<Parameters_gap-extend>1</Parameters_gap-extend>
<Parameters_filter>F</Parameters_filter>
</Parameters>
</BlastOutput_param>
<BlastOutput_iterations>
<Iteration>
<Iteration_iter-num>1</Iteration_iter-num>
<Iteration_query-ID>Query_1</Iteration_query-ID>
<Iteration_query-def>peptide <unknown description=""></Iteration_query-def>
<Iteration_query-len>8</Iteration_query-len>
<Iteration_hits>
<Hit>
<Hit_num>1</Hit_num>
<Hit_id>S507_scaffold13_size114854|S507_scaffold13_size114854_recno_56.0|(+)20770:21546</Hit_id>
<Hit_def>S507_scaffold13_size114854|S507_scaffold13_size114854_recno_56.0|(+)20770:21546 Six_Frame_ORF</Hit_def>
<Hit_accession>Subject_1</Hit_accession>
<Hit_len>258</Hit_len>
<Hit_hsps>
<Hsp>
<Hsp_num>1</Hsp_num>
<Hsp_bit-score>20.5747</Hsp_bit-score>
<Hsp_score>41</Hsp_score>
<Hsp_evalue>0.000156356</Hsp_evalue>
<Hsp_query-from>1</Hsp_query-from>
<Hsp_query-to>8</Hsp_query-to>
<Hsp_hit-from>237</Hsp_hit-from>
<Hsp_hit-to>244</Hsp_hit-to>
<Hsp_query-frame>0</Hsp_query-frame>
<Hsp_hit-frame>0</Hsp_hit-frame>
<Hsp_identity>5</Hsp_identity>
<Hsp_positive>8</Hsp_positive>
<Hsp_gaps>0</Hsp_gaps>
<Hsp_align-len>8</Hsp_align-len>
<Hsp_qseq>FTDFQGGV</Hsp_qseq>
<Hsp_hseq>FTNYHGGV</Hsp_hseq>
<Hsp_midline>FT+++GGV</Hsp_midline>
</Hsp>
</Hit_hsps>
</Hit>
</Iteration_hits>
<Iteration_stat>
<Statistics>
<Statistics_db-num>0</Statistics_db-num>
<Statistics_db-len>0</Statistics_db-len>
<Statistics_hsp-len>0</Statistics_hsp-len>
<Statistics_eff-space>2064</Statistics_eff-space>
<Statistics_kappa>0.11</Statistics_kappa>
<Statistics_lambda>0.294</Statistics_lambda>
<Statistics_entropy>0.61</Statistics_entropy>
</Statistics>
</Iteration_stat>
</Iteration>
</BlastOutput_iterations>
</BlastOutput>
..... More specifically:
FTDFQGGV
FT+++GGV
FTNYHGGV
Any advice would be much appreciated!
Kind regards
Thys
Not a solution to this particular problem but adding
-task blastp-short
to the blastp command could be tested as in described in the NCBI Blast help page