I have 8000 protein sequences that I want to cluster based on similarity (not identity) and select the longest representative sequence from each cluster. I checked several tools like HiFix, SiliX, ClusTR but could not find the optimal solution. I want to do clustering as like CD-Hit does to reduce dataset but based on sequence similarity rather that sequence identity.
Are you asking what is the best way to cluster protein sequences or are you just looking for any tool that does what you want? I would assume you could use any tool that clustered on similarity, and then write a script that selects the longest member of each cluster, no?
You mean you want to cluster base on amino acid properties rather than identical protein sequence? What are you referring to when you say 'similarity'?
If you know a programming language, you could implement this rather easily. You could blast your sequences against themselves and create your clusters based on similarity (e-value or % identity). Then, it's only a matter of choosing the longest.