Validation Of Sequence Clustering
2
1
Entering edit mode
13.2 years ago
Jake Mick ▴ 50

I've derived a novel algorithm for flat clustering of protein sequences that is part of a recent analysis I performed.

I'm completely dumbfounded with validation, as I don't have a external criterion to compare my results against:

1) What is an acceptable re-sampling scheme? Would it be acceptable to split the alignment into two randomly selected columns of residues in the MSA?

2) Does there exist a external measure of cluster quality for sequence information? I've read: http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html It appears that there is no great internal measure of cluster quality. Is there any measure for sequences? My sequences contain orthologs and paralogs so I'm hesistant to compare it to phylogenetic data.

Thanks!

sequence clustering pipeline • 2.6k views
ADD COMMENT
1
Entering edit mode
13.2 years ago

I would say that it is not acceptable to split the alignment into two randomly selected columns of residues in the MSA (Q1) because it is possible that residues in one half of this split or their alignment influence the alignment in the other half.

One measure of quality is to compare to known protein structure because sequence begets structure and structure begets function: Do different groups of rows in the alignment correspond to different or the same protein structure? If any members of the MSA have solved structures, you can begin this comparison.

ADD COMMENT
0
Entering edit mode

Thanks, this helps. SCOP classifies them together, but I can make a case for their differences as there's a insertion in one of the domains.

ADD REPLY
0
Entering edit mode

You're welcome. That insert is exactly the kind of discriminator that makes sense from both a computational/algorithm perspective as well as the biological function side of things.

ADD REPLY
1
Entering edit mode
13.2 years ago
Casbon ★ 3.3k

I used SCOP superfamilies. As Larry says, you can't go wrong with a manually curated structural database when testing a sequence-based clustering.

ADD COMMENT

Login before adding your answer.

Traffic: 1988 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6