Question

Make matrix of protein pairwise identities/similarities from multiple protein sequences

1

Entering edit mode

7.2 years ago

al-ash ▴ 210

Is there an already existing tool to generate a matrix of pairwise protein identities/similarities for an input which consists of multiple protein sequences?

I did not find a working solution for MAC OS/UNIX (the non-working solution for me is MatGAT for which I managed to find executables only for Windows OS).

I'm aware that parsing results from pairwise alignments of all pairwise combinations of proteins from the input file and arranging it into a table is one solution but I'm trying to avoid this at this point as it would take me, with my current skills, a lot of time to write such a script.

UPDATE To be more specific, I'm looking for % protein sequence identities from global sequence alignment (such as the % similarities/identities retrieved by https://www.ebi.ac.uk/Tools/psa/emboss_needle/)

pairwise protein identity similarity matrix • 15k views

ADD COMMENT • link updated 3.4 years ago by sahudson777 • 0 • written 7.2 years ago by al-ash ▴ 210

score 1 · Answer 1 · 2018-03-02

Phylip uses its own special interleaved sequence alignment, which is definitely neither FASTA format nor CLUSTAL format, but you can find programs that will convert. Phylip format is well known and quite old (1980's).

The advantage of Phylip's protdist over clustal's is that it gives corrected (scaled) protein distances, not raw similarities/distances. As protein similarities go down, (< 50% identity, which is very high for proteins), the distances go up exponentially, so that a 50% identical sequence might have a distance of PAM70, while a 30% identical sequence could be PAM160, and 20% identity PAM250. protdist does the conversion from observed protein distance to corrected evolutionary distances, using one of several evolutionary models.

score 1 · Answer 2 · 2018-11-06

1

Entering edit mode

6.5 years ago

al-ash ▴ 210

I ended up with the following command line solution using clustal omega which converts distance matrix to percent identity matrix:

clustalo-1.2.4-Ubuntu-x86_64 --full --percent-id --distmat-out=output.distmat -i input.aa.fa

ADD COMMENT • link 6.5 years ago by al-ash ▴ 210

0

Entering edit mode

What is a good threshold on percent identity (produced by Clustal Omega) to tell two sequences are similar? What is the minimum identity that indicates a good match? How do you interpret the numbers? Thank you!

ADD REPLY • link 3.9 years ago by taojincs ▴ 50

1

Entering edit mode

There is no magic number. It is context and question dependent, and different for protein and DNA. You have to decide what 'similarity' means in the context of your underlying question.

ADD REPLY • link 3.9 years ago by Joe 22k

0

Entering edit mode

Thank you!

I was looking to solve a similar problem (make matrix table of percent identity/percent matching for every pairwise comparison of 189 peptide sequences, WITHOUT/BEFORE any multiple sequence alignment (MSA)).

The command line code/operation that you provided above worked well, thank you.

I used the Windows 64-bit precompiled binary of Clustal Omega downloaded from here: http://www.clustal.org/omega/

This readme webpage also has complementary details regarding the command subcomponents: https://github.com/hybsearch/clustalo/blob/master/README

It reads:

"In order to produce a multiple alignment Clustal-Omega requires a guide tree which defines the order in which sequences/profiles are aligned. A guide tree in turn is constructed, based on a distance matrix. Conventionally, this distance matrix is comprised of all the pair-wise distances of the sequences. The distance measure Clustal-Omega uses for pair-wise distances of un-aligned sequences is the k-tuple measure [4], which was also implemented in Clustal 1.83 and ClustalW2 [5,6]..." etc.

--full

Use full distance matrix for guide-tree calculation (slow; mBed is default)

--percent-id

convert distances into percent identities (default no)

ADD REPLY • link 3.4 years ago by sahudson777 • 0

score 0 · Answer 3 · 2018-03-01

0

Entering edit mode

7.2 years ago

Joe 22k

Are you looking for something like a Position Specific Score Matrix? In which case, BioPython can build this for you already.

http://biopython.org/DIST/docs/api/Bio.Align.AlignInfo.PSSM-class.html

ADD COMMENT • link 7.2 years ago by Joe 22k

score 0 · Answer 4 · 2018-03-01

0

Entering edit mode

7.2 years ago

Bill Pearson ★ 1.1k

The Phylip program package (http://evolution.genetics.washington.edu/phylip/getme-new1.html), which uses an unfortunate format for multiple sequence alignment, includes "protdist", which does exactly what you want, and converts from observed distance to evolutionary distance.

ADD COMMENT • link 7.2 years ago by Bill Pearson ★ 1.1k

0

Entering edit mode

Not using Phylip before, I'm a bit confused by their documentation - according to http://evolution.genetics.washington.edu/phylip/doc/protdist.html the "program uses protein sequences" which would evoke to me, that the inout is multifasta, but actually it seems that the input is rather multiple alignment, according to what you wrote (?) and also I'm not sure, that the can be % identities and/or similarities (please see my updated question, I was apparently not clear enough).

ADD REPLY • link 7.2 years ago by al-ash ▴ 210

1

Entering edit mode

Clustal can report pairwise identities I believe, but it won’t write you a matrix, you’d still have to parse that out yourself.

ADD REPLY • link 7.2 years ago by Joe 22k

1

Entering edit mode

You are right! Clustal Omega (https://www.ebi.ac.uk/Tools/msa/clustalo/) gives directly sequence %identity matrix (Result Summary -> Percent Identity Matrix in the web interface).