Entering edit mode
7.7 years ago
bbb
▴
70
Which cluster method is better to use to cluster DNAs of different species based on alignment information (matches, deletions, insertion)? i.e. reference sequence - sequence of 4000 b.p. length, then feature set is 4000 * |{b.p. from reads which was matched exactly, b.p. insertions, deletions}| = 12000
What about CD-HIT ??
CD-hit is very good to remove redundancy but is not adequate for clustering. I didn't understand the question asked, though. For clustering you need a metric of similarity or distance.
Starting with distance matrices, affinity propagation clustering has worked quite nicely for me.