Question

Substitution matrices to score variation between protein sequences?

1

Entering edit mode

8.6 years ago

nchuang ▴ 260

Trying to understand substitution matrices. It seems like it is a scoring scheme for alignments, particularly if you are looking for homology?

I am trying to see if it would be applicable if I am looking at mutations between proteins from different people. Since my sequences are very similar with only 1 or 2 mutations between them, the substitution matrix would probably not be applicable here? I am assuming if there is a nonsynonymous mutation between two sequences it would give me a score (say BLOSUM62) based on how likely that substitution would occur in nature? Are there other ways to interpret these scoring matrices?

alignment blosum62 • 2.7k views

ADD COMMENT • link updated 8.6 years ago by Steven Lakin ★ 1.8k • written 8.6 years ago by nchuang ▴ 260

score 5 · Accepted Answer · 2016-05-06

Before we go into your question, it may be best and most concise to simply describe the exact SNP sites and leave it at that, given that your proteins are so similar. However, here are the differences in PAM and BLOSUM:

BLOSUM (BLOcks SUbstitution Matrix) were derived by looking at alignments of highly conserved protein domains at different evoluntionarily divergent distances, then taking into account how frequently one amino acid was substituted to another. It's described in this paper by Henikoff. They are based on local alignment of conserved protein regions.

PAM (Point Accepted Mutations) matrices were first described by Margaret Dayhoff (who was a fantastic scientist, even in face of the challenges of her role given the time period). "Each entry in a PAM matrix indicates the likelihood of the amino acid of that row being replaced with the amino acid of that column through a series of one or more point accepted mutations during a specified evolutionary interval, rather than these two amino acids being aligned due to chance." They are based on global alignment.

In short, this is what matters about the differences between the two:

PAM matrices are typically used on more closely related proteins (such as your case), BLOSUM are typically used on more evolutionarily divergent proteins.
The greater the PAM number the more DISTANT the sequences being compared should be; the greater the BLOSUM number, the more SIMILAR the sequences being compared should be.

So for your application, if you were to use these, you should either use a LOW PAM matrix or a HIGH BLOSUM matrix number. Whether this is appropriate for your application depends on what you want to get out of it (e.g. the whole protein difference or just local protein domain differences); you're right in that they are typically used for alignment scoring, but they can also be used to generate some evolutionary cost distance. However, there may be better methods out there for your purpose if you look for methods for creating distance trees based on some metric.