Hi all,
Basic question: I am interested in clustering a group of amino acid sequences into clusters reflecting evolutionary relationships.
I have a set of about 40 amino acid sequences from four yeast species. I want to know if there are any homologs (either orthologs between species or paralogs within a species) among the 40 sequences. The 40 sequences include 4 sequences (one from each species) which I identified as orthologs using pHMMER. Additionally I added three known mammal orthologs as a control.
I as advised to use Clustal Omega to align the sequences and then identify the clusters by the resulting cladogram. However, I am unsure how valid this method is if multiple non-homologous sequences are used. How can we trust the resulting MSA or any phylogeny based on it?
I used four aligners (CLustal Omega, MAFFT, t-coffee, and Muscle). Each gives a different tree topology, although the three mammal sequences cluster in all four and the four yeast homologs cluster in two trees.
I have also tried CD-Hit (using lowest sequence identity threshold of 0.3). With this method the only clusters identified are the three mammal sequences.
tl;dr Any advice or suggestions for