Blastclust Has Been Depreciated. Does Anyone Know Why?
3
5
Entering edit mode
11.5 years ago
tyler.weirick ▴ 120

I am working on machine learning project using SVMs. One of the steps in the preparation of my data sets is to reduce the sequence similarity in each class to 40%. I have compared CD-HIT and BLASTCLUST for this step. BLASTCLUST keeps more sequences that CD-HIT. It is tempting to use this data as larger data sets are preferable for my work, but I am worried as BLASTCLUST has been depreciated from the blast+ package. Does anyone know why blastclust was depreciated? Or why I am getting significantly more clusters from BLASTCLUST vs CD-HIT?

clustering blast+ • 8.4k views
ADD COMMENT
5
Entering edit mode
10.4 years ago

In my understanding BLASTCLUST and CD-HIT are algorithmically quite different. BLASTCLUST does clustering by doing the exhaustive BLAST all-to-all pairwise alignments, that means that it is slow but accurate. In contrast CD-HIT clusters by using heuristics to find high identity segments, that makes it very fast but not as exact as BLASTCLUST.

So I think there's 2 different kind of target use-cases for both programs. For instance I use BLASTCLUST to cluster sequences from the PDB since it is accurate and the number of sequences is not so enormous (around 100,000 at the moment) so it only takes a few hours to run.

That's why I upvoted the question, in my opinion it is indeed an issue for the community that BLASTCLUST is now deprecated.

ADD COMMENT
2
Entering edit mode
11.5 years ago
Josh Herr 5.8k

I am not certain why BLAST-CLUST has been depreciated but there are good archives for BLAST legacy versions. It may have been depreciated because there are numerous clustering programs which have changed in the last 5 years, such as CD-HIT, UCLUST, etc., and no one has decided to develop or maintain new versions of BLAST-CLUST.

I am not sure why you are getting significantly more clusters using BLAST-CLUST than CD-HIT. You did not provide us any information on the extent of how many more sequences are in your BLAST-CLUST computation vs. CD-HIT computation. Even with the exact sequences you may have different clustering based on the algorithm differences between BLAST-CLUST and CD-HIT, so if you have different sequences you will obviously have different clustering schemes. Have you tried other clustering programs using the exact same sequences?

ADD COMMENT
1
Entering edit mode
11.5 years ago

You can try BLAST2.2.14, it contains BLASTCLUST!

ADD COMMENT

Login before adding your answer.

Traffic: 1964 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6