Question

How is sequence similarity different than percent identity from BLASTP?

0

Entering edit mode

2.7 years ago

Viraj • 0

Hi Everyone,

I am new to Biostars. I am having trouble finding a concrete answer to the post's question.

My understanding is that sequence similarity is the fraction of residues that are similar between two different protein sequences. Percent identity is the number of characters that match exactly between two different sequences.

I read that sequence similarity is strongly correlated to percent identity. I also read that it is a subset of percent identity. These two are contradicting.

Can someone help me distinguish between the two concepts? Thanks

Homology BLASTP • 19k views

ADD COMMENT • link 2.7 years ago by Viraj • 0

1

Entering edit mode

Hi! Have you read this webpage? I think it is nicely explained :)

ADD REPLY • link 2.7 years ago by iraun 6.2k

0

Entering edit mode

Thank you for the link.

Looking at the link and these sequences:

A: AAGGCTT

B: AAGGC

I understand this has 100% identity. How is this 60% similar?

Edit distance is minimal number of edit operations (inserts, deletes, and substitutions) in order to transform the one sequence into an exact copy of the other sequence being aligned

Similar = 1 - edit distance/ unaligned length of shorter sequence

Therefore, similar = 1 - (2/2) or 1. Not sure how the author got 60%. Either the author made a typo in the similar definition or the math is wrong.

Can someone explain? Thanks.

ADD REPLY • link 2.7 years ago by Viraj • 0

score 1 · Accepted Answer · 2022-10-07

Sequence identity has a literal meaning that should be easy to understand. When the two sequences are aligned, any pair of residue is either identical, or it isn't.

Sequence similarity is a broader term, and always includes identity. That means identical residues are always similar by definition, while the opposite is not necessarily true. Therefore, sequence similarity is equal to or greater than sequence identity. Similarity includes conservative substitutions that usually have positive scores in substitution matrices.

The alignment below has 430/432 identical residues (see under Identities) and 432/432 similar residues (see under Positives). If you look in the middle alignment row, similar residues have a + sign instead of residue letters (around positions 285 and 340). If the residues were not similar, there would be an empty space instead of +.

Score           Expect  Method                          Identities  Positives   Gaps
877 bits(2265)  0.0 Compositional matrix adjust.    430/432(99%)    432/432(100%)   0/432(0%)
Query  1    MRECISIHVGQAGVQIGNACWELYCLEHGIQPDGQMPSDKTIGGGDDSFNTFFSETGAGK  60
            MRECISIHVGQAGVQIGNACWELYCLEHGIQPDGQMPSDKTIGGGDDSFNTFFSETGAGK
Sbjct  1    MRECISIHVGQAGVQIGNACWELYCLEHGIQPDGQMPSDKTIGGGDDSFNTFFSETGAGK  60

Query  61   HVPRAVFVDLEPTVIDEVRTGTYRQLFHPEQLITGKEDAANNYARGHYTIGKEIIDLVLD  120
            HVPRAVFVDLEPTVIDEVRTGTYRQLFHPEQLITGKEDAANNYARGHYTIGKEIIDLVLD
Sbjct  61   HVPRAVFVDLEPTVIDEVRTGTYRQLFHPEQLITGKEDAANNYARGHYTIGKEIIDLVLD  120

Query  121  RIRKLADQCTGLQGFLVFHsfgggtgsgftsLLMERLSVDYGKKSKLEFSIYPAPQVSTA  180
            RIRKLADQCTGLQGFLVFHSFGGGTGSGFTSLLMERLSVDYGKKSKLEFSIYPAPQVSTA
Sbjct  121  RIRKLADQCTGLQGFLVFHSFGGGTGSGFTSLLMERLSVDYGKKSKLEFSIYPAPQVSTA  180

Query  181  VVEPYNSILTTHTTLEHSDCAFMVDNEAIYDICRRNLDIERPTYTNLNRLISQIVSSITA  240
            VVEPYNSILTTHTTLEHSDCAFMVDNEAIYDICRRNLDIERPTYTNLNRLISQIVSSITA
Sbjct  181  VVEPYNSILTTHTTLEHSDCAFMVDNEAIYDICRRNLDIERPTYTNLNRLISQIVSSITA  240

Query  241  SLRFDGALNVDLTEFQTNLVPYPRIHFPLATYAPVISAEKAYHEQLSVAEITNACFEPAN  300
            SLRFDGALNVDLTEFQTNLVPYPRIHFPLATYAPVISAEKAYHEQL+VAEITNACFEPAN
Sbjct  241  SLRFDGALNVDLTEFQTNLVPYPRIHFPLATYAPVISAEKAYHEQLTVAEITNACFEPAN  300

Query  301  QMVKCDPRHGKYMACCLLYRGDVVPKDVNAAIATIKTKRSIQFVDWCPTGFKVGINYQPP  360
            QMVKCDPRHGKYMACCLLYRGDVVPKDVNAAIATIKTKR+IQFVDWCPTGFKVGINYQPP
Sbjct  301  QMVKCDPRHGKYMACCLLYRGDVVPKDVNAAIATIKTKRTIQFVDWCPTGFKVGINYQPP  360

Query  361  TVVPGGDLAKVQRAVCMLSNTTAIAEAWARLDHKFDLMYAKRAFVHWYVGEGMEEGEFSE  420
            TVVPGGDLAKVQRAVCMLSNTTAIAEAWARLDHKFDLMYAKRAFVHWYVGEGMEEGEFSE
Sbjct  361  TVVPGGDLAKVQRAVCMLSNTTAIAEAWARLDHKFDLMYAKRAFVHWYVGEGMEEGEFSE  420

Query  421  AREDMAALEKDY  432
            AREDMAALEKDY
Sbjct  421  AREDMAALEKDY  432