Is there an already existing tool to generate a matrix of pairwise protein identities/similarities for an input which consists of multiple protein sequences?
I did not find a working solution for MAC OS/UNIX (the non-working solution for me is MatGAT for which I managed to find executables only for Windows OS).
I'm aware that parsing results from pairwise alignments of all pairwise combinations of proteins from the input file and arranging it into a table is one solution but I'm trying to avoid this at this point as it would take me, with my current skills, a lot of time to write such a script.
UPDATE To be more specific, I'm looking for % protein sequence identities from global sequence alignment (such as the % similarities/identities retrieved by https://www.ebi.ac.uk/Tools/psa/emboss_needle/)
What is a good threshold on percent identity (produced by Clustal Omega) to tell two sequences are similar? What is the minimum identity that indicates a good match? How do you interpret the numbers? Thank you!
There is no magic number. It is context and question dependent, and different for protein and DNA. You have to decide what 'similarity' means in the context of your underlying question.
Thank you!
I was looking to solve a similar problem (make matrix table of percent identity/percent matching for every pairwise comparison of 189 peptide sequences, WITHOUT/BEFORE any multiple sequence alignment (MSA)).
The command line code/operation that you provided above worked well, thank you.
I used the Windows 64-bit precompiled binary of Clustal Omega downloaded from here: http://www.clustal.org/omega/
This readme webpage also has complementary details regarding the command subcomponents: https://github.com/hybsearch/clustalo/blob/master/README
It reads:
"In order to produce a multiple alignment Clustal-Omega requires a guide tree which defines the order in which sequences/profiles are aligned. A guide tree in turn is constructed, based on a distance matrix. Conventionally, this distance matrix is comprised of all the pair-wise distances of the sequences. The distance measure Clustal-Omega uses for pair-wise distances of un-aligned sequences is the k-tuple measure [4], which was also implemented in Clustal 1.83 and ClustalW2 [5,6]..." etc.
--full
Use full distance matrix for guide-tree calculation (slow; mBed is default)
--percent-id
convert distances into percent identities (default no)