Hi All,
I have a set of 400 nucleotide sequences that I want to cluster on basis of similarity. For clustering, I am expecting similarity <=45% among members of a cluster. Also, there will be a few sequences that do not show similarity to any other members. Is there any clustering approach that allow us to set a cut-off for relation (similarity) between members? and can keep the members with very low similarity to a "unclustered" set?
I have generated percentage identity matrix (400 x 400) using clustal-omega, and using this matrix for clustering by "affinity-propagation" approach is not giving good results.
p.s. I have had used "cd-hit" and "uclust" already but they are not recommended for cases when expected sequence similarity is below 70%.
Bade
@Damian - Thanks for suggestion. What would you suggest instead of percent identity? The sequences are not protein-coding genes. Would using distance matrix from Clustal-Omega be better?