I've derived a novel algorithm for flat clustering of protein sequences that is part of a recent analysis I performed.
I'm completely dumbfounded with validation, as I don't have a external criterion to compare my results against:
1) What is an acceptable re-sampling scheme?
Would it be acceptable to split the alignment into two randomly selected columns of residues in the MSA?
2) Does there exist a external measure of cluster quality for sequence information?
I've read: http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html
It appears that there is no great internal measure of cluster quality. Is there any measure for sequences? My sequences contain orthologs and paralogs so I'm hesistant to compare it to phylogenetic data.
I would say that it is not acceptable to split the alignment into two randomly selected columns of residues in the MSA (Q1) because it is possible that residues in one half of this split or their alignment influence the alignment in the other half.
One measure of quality is to compare to known protein structure because sequence begets structure and structure begets function: Do different groups of rows in the alignment correspond to different or the same protein structure? If any members of the MSA have solved structures, you can begin this comparison.
You're welcome. That insert is exactly the kind of discriminator that makes sense from both a computational/algorithm perspective as well as the biological function side of things.
Thanks, this helps. SCOP classifies them together, but I can make a case for their differences as there's a insertion in one of the domains.
You're welcome. That insert is exactly the kind of discriminator that makes sense from both a computational/algorithm perspective as well as the biological function side of things.