I am working on machine learning project using SVMs. One of the steps in the preparation of my data sets is to reduce the sequence similarity in each class to 40%. I have compared CD-HIT and BLASTCLUST for this step. BLASTCLUST keeps more sequences that CD-HIT. It is tempting to use this data as larger data sets are preferable for my work, but I am worried as BLASTCLUST has been depreciated from the blast+ package. Does anyone know why blastclust was depreciated? Or why I am getting significantly more clusters from BLASTCLUST vs CD-HIT?
In my understanding BLASTCLUST and CD-HIT are algorithmically quite different. BLASTCLUST does clustering by doing the exhaustive BLAST all-to-all pairwise alignments, that means that it is slow but accurate. In contrast CD-HIT clusters by using heuristics to find high identity segments, that makes it very fast but not as exact as BLASTCLUST.
So I think there's 2 different kind of target use-cases for both programs. For instance I use BLASTCLUST to cluster sequences from the PDB since it is accurate and the number of sequences is not so enormous (around 100,000 at the moment) so it only takes a few hours to run.
That's why I upvoted the question, in my opinion it is indeed an issue for the community that BLASTCLUST is now deprecated.
I am not certain why BLAST-CLUST has been depreciated but there are good archives for BLAST legacy versions. It may have been depreciated because there are numerous clustering programs which have changed in the last 5 years, such as CD-HIT, UCLUST, etc., and no one has decided to develop or maintain new versions of BLAST-CLUST.
I am not sure why you are getting significantly more clusters using BLAST-CLUST than CD-HIT. You did not provide us any information on the extent of how many more sequences are in your BLAST-CLUST computation vs. CD-HIT computation. Even with the exact sequences you may have different clustering based on the algorithm differences between BLAST-CLUST and CD-HIT, so if you have different sequences you will obviously have different clustering schemes. Have you tried other clustering programs using the exact same sequences?