Question

multilabel clustering/classification based on sequence silimarity

0

Entering edit mode

7.7 years ago

odoluca ▴ 20

I have been dealing with a new issue regarding clustering/classifying sequences.

I need an algorithm that can cluster huge number of short sequences. To simulate the problem I have made some visuals and a small set of sequences.

fig1 http://uploads.im/zxAZC.png

From the image it should be clear that there are 2 conserved sequences, the blue and the orange. The green shows variations

What I need is to label the sequences based on what conserved sequence they contain, so that there will be two labels; label A: 0,1,2,5,6,7,9,10 and label B: 0, 3, 4, 5, 7, 8, 11

If I build a similarity matrix based of pairwise alignment I build this matrix:

fig2 http://uploads.im/1jsJx.png

But I am not sure how to process the matrix further to label them. Anyone has any idea?

sequence pairwise clustering classification • 1.7k views

ADD COMMENT • link updated 7.7 years ago by Biostar 20 • written 7.7 years ago by odoluca ▴ 20

0

Entering edit mode

I am not sure I get what you're trying to achieve. If you want to identify conserved sequences then the standard approach is to use multiple sequence alignments. If you already know the conserved sequences and want to find out if they are present in your sequences, you could simply assess similarity of each test sequence to each of the conserved sequences. A pairwise similarity matrix only tells you how closely related two sequences are but not what kind of motif/sequence they share.

ADD REPLY • link 7.7 years ago by Jean-Karim Heriche 27k