Sequence Clustering Based On Similarity
1
0
Entering edit mode
11.5 years ago
SK ▴ 110

I have 8000 protein sequences that I want to cluster based on similarity (not identity) and select the longest representative sequence from each cluster. I checked several tools like HiFix, SiliX, ClusTR but could not find the optimal solution. I want to do clustering as like CD-Hit does to reduce dataset but based on sequence similarity rather that sequence identity.

clustering phylogenetics r sequence • 5.5k views
ADD COMMENT
0
Entering edit mode

Are you asking what is the best way to cluster protein sequences or are you just looking for any tool that does what you want? I would assume you could use any tool that clustered on similarity, and then write a script that selects the longest member of each cluster, no?

ADD REPLY
0
Entering edit mode

You mean you want to cluster base on amino acid properties rather than identical protein sequence? What are you referring to when you say 'similarity'?

ADD REPLY
0
Entering edit mode

If you know a programming language, you could implement this rather easily. You could blast your sequences against themselves and create your clusters based on similarity (e-value or % identity). Then, it's only a matter of choosing the longest.

ADD REPLY
0
Entering edit mode
8.2 years ago

this question was cross-posted and answered at http://seqanswers.com/forums/showthread.php?t=31750

Perhaps USEARCH[1] or if you want something much more complicated, OrthoMCL.[1].

[1] http://www.drive5.com/usearch/ [2] http://orthomcl.org/common/downloads/software/v2.0/

ADD COMMENT

Login before adding your answer.

Traffic: 1964 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6