I used both BlastClust and CD-HIT for clustering protein sequences. The output of CD-HIT looks good but it is creating too many clusters (6000 clusters for 18000 protein sequences). When I used Blastclust, I got less number of sequences but I found a major bug. The first cluster is always contain maximum number of sequences and all others contain very very less number of sequences (1405 number of sequences in first cluster and less than 40 in all others). When I checked the sequences of first cluster manually, I found that they are not similar sequences at all. But other clusters except first one seem to be good. I am just wondering if anybody had similar problem. Should I use CD-HIT instead of BlastClust? Is there any better tool for protein Sequence clustering?
Have you tried to simply increase the identity threshold when using CDhit?
Also, what is the biological reasoning behind clustering? It sound like these sequences are already highly similar. Are you trying to divide them by certain mutation or something?
Yes, I ran CD-HIT with Sequence identity threshold 0.9, 0.8, 0.7, 0.6, 0.5 and 0.4. I am still getting large number of clusters. I am clustering the homologous genes of three different viruses.