Question

Sequence Clustering Based On Similarity

0

Entering edit mode

11.5 years ago

SK ▴ 110

I have 8000 protein sequences that I want to cluster based on similarity (not identity) and select the longest representative sequence from each cluster. I checked several tools like HiFix, SiliX, ClusTR but could not find the optimal solution. I want to do clustering as like CD-Hit does to reduce dataset but based on sequence similarity rather that sequence identity.

clustering phylogenetics r sequence • 5.5k views

ADD COMMENT • link updated 22 months ago by Ram 44k • written 11.5 years ago by SK ▴ 110

0

Entering edit mode

Are you asking what is the best way to cluster protein sequences or are you just looking for any tool that does what you want? I would assume you could use any tool that clustered on similarity, and then write a script that selects the longest member of each cluster, no?

ADD REPLY • link 11.5 years ago by KCC ★ 4.1k

0

Entering edit mode

You mean you want to cluster base on amino acid properties rather than identical protein sequence? What are you referring to when you say 'similarity'?

ADD REPLY • link 11.5 years ago by Damian Kao 16k

0

Entering edit mode

If you know a programming language, you could implement this rather easily. You could blast your sequences against themselves and create your clusters based on similarity (e-value or % identity). Then, it's only a matter of choosing the longest.

ADD REPLY • link 11.5 years ago by Eric Normandeau 11k

score 0 · Answer 1 · 2016-10-10

0

Entering edit mode

8.2 years ago

Pablo Marin-Garcia ★ 2.0k

this question was cross-posted and answered at http://seqanswers.com/forums/showthread.php?t=31750

Perhaps USEARCH[1] or if you want something much more complicated, OrthoMCL.[1].

[1] http://www.drive5.com/usearch/ [2] http://orthomcl.org/common/downloads/software/v2.0/

ADD COMMENT • link 8.2 years ago by Pablo Marin-Garcia ★ 2.0k