How is sequence similarity different than percent identity from BLASTP?
1
0
Entering edit mode
2.1 years ago
Viraj • 0

Hi Everyone,

I am new to Biostars. I am having trouble finding a concrete answer to the post's question.

My understanding is that sequence similarity is the fraction of residues that are similar between two different protein sequences. Percent identity is the number of characters that match exactly between two different sequences.

I read that sequence similarity is strongly correlated to percent identity. I also read that it is a subset of percent identity. These two are contradicting.

Can someone help me distinguish between the two concepts? Thanks

Homology BLASTP • 11k views
ADD COMMENT
1
Entering edit mode

Hi! Have you read this webpage? I think it is nicely explained :)

ADD REPLY
0
Entering edit mode

Thank you for the link.

Looking at the link and these sequences:

A: AAGGCTT

B: AAGGC

I understand this has 100% identity. How is this 60% similar?

Edit distance is minimal number of edit operations (inserts, deletes, and substitutions) in order to transform the one sequence into an exact copy of the other sequence being aligned

Similar = 1 - edit distance/ unaligned length of shorter sequence

Therefore, similar = 1 - (2/2) or 1. Not sure how the author got 60%. Either the author made a typo in the similar definition or the math is wrong.

Can someone explain? Thanks.

ADD REPLY
1
Entering edit mode
2.1 years ago
Mensur Dlakic ★ 28k

Sequence identity has a literal meaning that should be easy to understand. When the two sequences are aligned, any pair of residue is either identical, or it isn't.

Sequence similarity is a broader term, and always includes identity. That means identical residues are always similar by definition, while the opposite is not necessarily true. Therefore, sequence similarity is equal to or greater than sequence identity. Similarity includes conservative substitutions that usually have positive scores in substitution matrices.

The alignment below has 430/432 identical residues (see under Identities) and 432/432 similar residues (see under Positives). If you look in the middle alignment row, similar residues have a + sign instead of residue letters (around positions 285 and 340). If the residues were not similar, there would be an empty space instead of +.

Score           Expect  Method                          Identities  Positives   Gaps
877 bits(2265)  0.0 Compositional matrix adjust.    430/432(99%)    432/432(100%)   0/432(0%)
Query  1    MRECISIHVGQAGVQIGNACWELYCLEHGIQPDGQMPSDKTIGGGDDSFNTFFSETGAGK  60
            MRECISIHVGQAGVQIGNACWELYCLEHGIQPDGQMPSDKTIGGGDDSFNTFFSETGAGK
Sbjct  1    MRECISIHVGQAGVQIGNACWELYCLEHGIQPDGQMPSDKTIGGGDDSFNTFFSETGAGK  60

Query  61   HVPRAVFVDLEPTVIDEVRTGTYRQLFHPEQLITGKEDAANNYARGHYTIGKEIIDLVLD  120
            HVPRAVFVDLEPTVIDEVRTGTYRQLFHPEQLITGKEDAANNYARGHYTIGKEIIDLVLD
Sbjct  61   HVPRAVFVDLEPTVIDEVRTGTYRQLFHPEQLITGKEDAANNYARGHYTIGKEIIDLVLD  120

Query  121  RIRKLADQCTGLQGFLVFHsfgggtgsgftsLLMERLSVDYGKKSKLEFSIYPAPQVSTA  180
            RIRKLADQCTGLQGFLVFHSFGGGTGSGFTSLLMERLSVDYGKKSKLEFSIYPAPQVSTA
Sbjct  121  RIRKLADQCTGLQGFLVFHSFGGGTGSGFTSLLMERLSVDYGKKSKLEFSIYPAPQVSTA  180

Query  181  VVEPYNSILTTHTTLEHSDCAFMVDNEAIYDICRRNLDIERPTYTNLNRLISQIVSSITA  240
            VVEPYNSILTTHTTLEHSDCAFMVDNEAIYDICRRNLDIERPTYTNLNRLISQIVSSITA
Sbjct  181  VVEPYNSILTTHTTLEHSDCAFMVDNEAIYDICRRNLDIERPTYTNLNRLISQIVSSITA  240

Query  241  SLRFDGALNVDLTEFQTNLVPYPRIHFPLATYAPVISAEKAYHEQLSVAEITNACFEPAN  300
            SLRFDGALNVDLTEFQTNLVPYPRIHFPLATYAPVISAEKAYHEQL+VAEITNACFEPAN
Sbjct  241  SLRFDGALNVDLTEFQTNLVPYPRIHFPLATYAPVISAEKAYHEQLTVAEITNACFEPAN  300

Query  301  QMVKCDPRHGKYMACCLLYRGDVVPKDVNAAIATIKTKRSIQFVDWCPTGFKVGINYQPP  360
            QMVKCDPRHGKYMACCLLYRGDVVPKDVNAAIATIKTKR+IQFVDWCPTGFKVGINYQPP
Sbjct  301  QMVKCDPRHGKYMACCLLYRGDVVPKDVNAAIATIKTKRTIQFVDWCPTGFKVGINYQPP  360

Query  361  TVVPGGDLAKVQRAVCMLSNTTAIAEAWARLDHKFDLMYAKRAFVHWYVGEGMEEGEFSE  420
            TVVPGGDLAKVQRAVCMLSNTTAIAEAWARLDHKFDLMYAKRAFVHWYVGEGMEEGEFSE
Sbjct  361  TVVPGGDLAKVQRAVCMLSNTTAIAEAWARLDHKFDLMYAKRAFVHWYVGEGMEEGEFSE  420

Query  421  AREDMAALEKDY  432
            AREDMAALEKDY
Sbjct  421  AREDMAALEKDY  432
ADD COMMENT
0
Entering edit mode

Thank you for the comprehensive answer! This means sequence similarity is the positive score and sequence identity is the identity score from BLAST

ADD REPLY
0
Entering edit mode

This means sequence similarity is the positive score and sequence identity is the identity score from BLAST

Not quite. Sequence identity is a fraction of identical residues, and similarity is a fraction of similar residues. BLAST provides a single score that includes everything rather than breaking it down by identity or similarity.

ADD REPLY
0
Entering edit mode

In your above example, identical residues is 430/432 and similar residues is 432/432. Similar residues include identical residues plus similar residues denoted as a plus. This is not right?

ADD REPLY
1
Entering edit mode

It is right, but they are not scores in the same sense as bit-score. They are fractions.

ADD REPLY
0
Entering edit mode

I see. Thank you for your answers and clarification. I appreciate the help!

ADD REPLY

Login before adding your answer.

Traffic: 2888 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6