Background
Earlier I asked a question on how to measure the "quality" of a multiple sequence alignment: Ab initio methods for inferring quality of Multiple Sequence Alignments
...there is also a duplicate: Multiple sequence alignment score
...and a different question about MSA similarity score: Similarity score of Multiple Sequence Alignment
So I have some tools to measure how "good" the multiple alignments are.
I can also make the MSAs "better" by using the scoring tools in the following way:
- Measure a score of the initial MSA.
- Remove the first sequence from the MSA, re-align, and re-measure the score
- If the score got worse then put the sequence back
- If the removal of all sequences has been tried then STOP, otherwise go to step 2
This should work because of the garbage-in-garbage-out nature of MSA. Hopefully, if I filter out the input, making it non-garbage then I should get a "good" output even for distantly related genes.
The Question
How can I verify that the MSAs actually did get better?
I'm interested in both closely and distantly related groups of proteins.
Things that I tried
...tried to think about.
- Verifying against the tiny fraction of groups of proteins that are know to be related to each other based on evidence of 3D structure superpositioning.
- Building a phylogenetic tree from the MSA and verifying the tree against know taxonomy of species. Using a simple rule (assumption) that most genes, unless they were horizontal transfered should have the same species ancestry as the whole organize.
We use an Hmmer3 based method to update the groups of sequences (protein families). The search space is all of in Uniprot (15 million sequences as of April). Many different species get picked up. I thought of taking advantage of that for verification purposes.