Substitution matrices to score variation between protein sequences?
1
1
Entering edit mode
8.6 years ago
nchuang ▴ 260

Trying to understand substitution matrices. It seems like it is a scoring scheme for alignments, particularly if you are looking for homology?

I am trying to see if it would be applicable if I am looking at mutations between proteins from different people. Since my sequences are very similar with only 1 or 2 mutations between them, the substitution matrix would probably not be applicable here? I am assuming if there is a nonsynonymous mutation between two sequences it would give me a score (say BLOSUM62) based on how likely that substitution would occur in nature? Are there other ways to interpret these scoring matrices?

alignment blosum62 • 2.7k views
ADD COMMENT
5
Entering edit mode
8.6 years ago
Steven Lakin ★ 1.8k

Before we go into your question, it may be best and most concise to simply describe the exact SNP sites and leave it at that, given that your proteins are so similar. However, here are the differences in PAM and BLOSUM:

BLOSUM (BLOcks SUbstitution Matrix) were derived by looking at alignments of highly conserved protein domains at different evoluntionarily divergent distances, then taking into account how frequently one amino acid was substituted to another. It's described in this paper by Henikoff. They are based on local alignment of conserved protein regions.

PAM (Point Accepted Mutations) matrices were first described by Margaret Dayhoff (who was a fantastic scientist, even in face of the challenges of her role given the time period). "Each entry in a PAM matrix indicates the likelihood of the amino acid of that row being replaced with the amino acid of that column through a series of one or more point accepted mutations during a specified evolutionary interval, rather than these two amino acids being aligned due to chance." They are based on global alignment.

In short, this is what matters about the differences between the two:

  1. PAM matrices are typically used on more closely related proteins (such as your case), BLOSUM are typically used on more evolutionarily divergent proteins.
  2. The greater the PAM number the more DISTANT the sequences being compared should be; the greater the BLOSUM number, the more SIMILAR the sequences being compared should be.

So for your application, if you were to use these, you should either use a LOW PAM matrix or a HIGH BLOSUM matrix number. Whether this is appropriate for your application depends on what you want to get out of it (e.g. the whole protein difference or just local protein domain differences); you're right in that they are typically used for alignment scoring, but they can also be used to generate some evolutionary cost distance. However, there may be better methods out there for your purpose if you look for methods for creating distance trees based on some metric.

ADD COMMENT
0
Entering edit mode

Fantastic answer !!

ADD REPLY
0
Entering edit mode

wow this really clears it up. I read the intro to Biological Sequence Analysis by Durbin and understood it but didn't know how it was applied.

I am trying to figure out if these SNPs do affect function and was hoping maybe substitution matrix may offer some surrogate value.

ADD REPLY

Login before adding your answer.

Traffic: 2178 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6