Hi,
Here is Altchely matrix
A;-0,591;-1,302;-0,733;1,57;-0,146
C;-1,343;0,465;-0,862;-1,020;-0,255
D;1,05;0,302;-3,656;-0,259;-3,242
E;1,357;-1,453;1,477;0,113;-0,837
F;-1,006;-0,590;1,891;-0,397;0,412
G;-0,384;1,652;1,33;1,045;2,064
H;0,336;-0,417;-1,673;-1,474;-0,078
I;-1,239;-0,547;2,131;0,393;0,816
K;1,831;-0,561;0,533;-0,277;1,648
L;-1,019;-0,987;-1,505;1,266;-0,912
M;-0,663;-1,524;2,219;-1,005;1,212
N;0,945;0,828;1,299;-0,169;0,933
P;0,189;2,081;-1,628;0,421;-1,392
Q;0,931;-0,179;-3,005;-0,503;-1,853
R;1,538;-0,055;1,502;0,44;2,897
S;-0,228;1,399;-4,760;0,67;-2,647
T;-0,032;0,326;2,213;0,908;1,313
V;-1,337;-0,279;-0,544;1,242;-1,262
W;-0,595;0,009;0,672;-2,128;-0,184
Y;0,26;0,83;3,097;-0,838;1,512
Factor 1 is termed the polarity index. It correlates well with a large variety of descriptors including the number of hydrogen bond donors, polarity versus nonpolarity, and hydrophobicity versus hydrophilicity.
Factor 2 is a secondarystructure index. It represents the propensity of an amino acid to be in a particular type of secondary structure, such as a coil, turn or bend versus the frequency of it in an α-helix.
Factor 3 is correlated with molecular size,volume and molecular weight.
Factor 4 reflects the number of codons coding for an amino acid and amino acid composition. These attributes are related to various physical properties including refractivity and heat capacity.
Factor 5 is related to the electrostatic charge.
I wrote some code to substitute aa with numeric values in my MSA. So I get about 1860 variable named F1_1
, F1_2
....F1_5
,F2_1
,F2_2
....F2_5
an do so on...
Because of the great number of variables (each column in the MSA *5), I perfomed a PCA using NIPALS.
I got 11 PCS.
I had a glance at the variable importance and I got a table with all residues (F1_1
,F1_2
...), their power (varying from 0 to 1) and their importance in the analysis, all in descrent order.
Then I had a glance at loadings matrix. Here I found 11 variables (the PCs) and cases on the rows, that assumed negative and positive values.
The score matrix in composed by 11 variables (PC) and the sequence names.
How could identify residues determining specificity? I think that more discriminating residues among groups, are probably the ones that determine specificity.
Should I perform a PCA with a categorized values to group protein in subfamily?
Should I perform a discriminant analysis on the PCA resulting matrix?
Jalview has PCA option on MSA, See if it helps....