Question

python calculation of protein multiple sequence alignment

0

Entering edit mode

4.2 years ago

ttubiana ▴ 30

Dear all :-)

I am trying to compute the conservation score of each position of a protein multiple sequence alignment. I already used the Shannon entropy, but I am not satisfied with it since it is not similarity-based but identity only. So I thought that maybe it could be a good idea to use a substitution matrix. I tried to implement two methods:

Protein–Protein Interfaces: Analysis of Amino Acid Conservation in Homodimers (https://doi.org/10.1002/1097-0134(20010101)42:1%3C108::AID-PROT110%3E3.0.CO;2-O)
the "sum-of-pairs" method from AL2CO (https://doi.org/10.1093/bioinformatics/17.8.700)

The first method gives me wrong results (maybe because I used BLOSUM62 instead of PET91 used in the article...). The second method (AL2CO) doesn't give me satisfying results.

In practice, I would like a score in [0,1] with some sensitivity to sequence redundancy. I have a workflow in python that process my alignment and calculate properties, so I try as much as possible to avoid external tools...

Do you have some bits of advice or maybe a hidden magick package that I didn't found :-)?

Best,
Thibault.

python alignment conservation • 4.5k views

ADD COMMENT • link updated 2.1 years ago by hugo.avila ▴ 530 • written 4.2 years ago by ttubiana ▴ 30

score 4 · Answer 1 · 2020-09-25

I think what you're looking for is contained in scipy.spatial.distance:

https://docs.scipy.org/doc/scipy/reference/spatial.distance.html

Among these measurements, I would suggest you to have a look at the Jaccard distance. It fits your requirement of the [0,1] range, and it's pretty easy to calculate. A Jaccard distance is calculated as the intersection over the union of two sets.

Since you're calculating distance between protein sequences in an MSA, you could consider each protein of a pairwise comparison as a set of ordered characters. The intersection of the two sets is the equal characters (i.e. the equal amino acids at the same position) between the two sets. The union is the total length of the pairwise alignment, gaps included.

EDIT:

One clarification: this calculation for J gives you a value that is closer to 1 the closer sequences are. So if you are measuring distance, you will have to do 1-J.

score 1 · Answer 2 · 2021-08-30

1

Entering edit mode

3.3 years ago

cpcantalapiedra ▴ 170

Hi,

maybe this may help

https://github.com/Cantalapiedra/msa_conservation_index

It is similarity based and yields values between 0 and 1.

Check

https://github.com/Cantalapiedra/msa_conservation_index#how-it-works

Best

ADD COMMENT • link 3.3 years ago by cpcantalapiedra ▴ 170

0

Entering edit mode

Hey cool script, thanks for making it !

ADD REPLY • link 2.1 years ago by hugo.avila ▴ 530