I have been dealing with a new issue regarding clustering/classifying sequences.
I need an algorithm that can cluster huge number of short sequences. To simulate the problem I have made some visuals and a small set of sequences.
From the image it should be clear that there are 2 conserved sequences, the blue and the orange. The green shows variations
What I need is to label the sequences based on what conserved sequence they contain, so that there will be two labels; label A: 0,1,2,5,6,7,9,10 and label B: 0, 3, 4, 5, 7, 8, 11
If I build a similarity matrix based of pairwise alignment I build this matrix:
But I am not sure how to process the matrix further to label them. Anyone has any idea?
I am not sure I get what you're trying to achieve. If you want to identify conserved sequences then the standard approach is to use multiple sequence alignments. If you already know the conserved sequences and want to find out if they are present in your sequences, you could simply assess similarity of each test sequence to each of the conserved sequences. A pairwise similarity matrix only tells you how closely related two sequences are but not what kind of motif/sequence they share.