Question

Validation Of Sequence Clustering

1

Entering edit mode

13.6 years ago

Jake Mick ▴ 50

I've derived a novel algorithm for flat clustering of protein sequences that is part of a recent analysis I performed.

I'm completely dumbfounded with validation, as I don't have a external criterion to compare my results against:

1) What is an acceptable re-sampling scheme? Would it be acceptable to split the alignment into two randomly selected columns of residues in the MSA?

2) Does there exist a external measure of cluster quality for sequence information? I've read: http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html It appears that there is no great internal measure of cluster quality. Is there any measure for sequences? My sequences contain orthologs and paralogs so I'm hesistant to compare it to phylogenetic data.

Thanks!

sequence clustering pipeline • 2.9k views

ADD COMMENT • link updated 13.6 years ago by Casbon ★ 3.3k • written 13.6 years ago by Jake Mick ▴ 50

score 1 · Answer 1 · 2011-09-29

1

Entering edit mode

13.6 years ago

Larry_Parnell 16k

I would say that it is not acceptable to split the alignment into two randomly selected columns of residues in the MSA (Q1) because it is possible that residues in one half of this split or their alignment influence the alignment in the other half.

One measure of quality is to compare to known protein structure because sequence begets structure and structure begets function: Do different groups of rows in the alignment correspond to different or the same protein structure? If any members of the MSA have solved structures, you can begin this comparison.

ADD COMMENT • link 13.6 years ago by Larry_Parnell 16k

0

Entering edit mode

Thanks, this helps. SCOP classifies them together, but I can make a case for their differences as there's a insertion in one of the domains.

ADD REPLY • link 13.6 years ago by Jake Mick ▴ 50

0

Entering edit mode

You're welcome. That insert is exactly the kind of discriminator that makes sense from both a computational/algorithm perspective as well as the biological function side of things.

ADD REPLY • link 13.6 years ago by Larry_Parnell 16k

score 1 · Answer 2 · 2011-09-29

1

Entering edit mode

13.6 years ago

Casbon ★ 3.3k

I used SCOP superfamilies. As Larry says, you can't go wrong with a manually curated structural database when testing a sequence-based clustering.

ADD COMMENT • link 13.6 years ago by Casbon ★ 3.3k