I would like to find a way to automate identifying whether sequences, or a sequence, within clusters of seemingly homologous sequences (in other words sequnces grouped together by multiple sequence alignment (MSA) algorithm) in my MSA genuinely belong to that group or not. This is for the sake of identifying whether they represent the same protein as their group and as to whether they should be collapsed into the same branch as their group on my tree.
I think that PCA could be a potential way to do this, however, I am not sure if it is. The way I imagined this could be done is by calculating PCA for the defined groups in my MSA and then by choosing a threshold eigen value to detremine which sequences could be grouped together, i.e. automatically detecting which sequences may be potentially representing a different/same protein to the one the group they are assigned to by the MSA.
Please suggest if that is a good way to do it or not