I am using CD-HIT to cluster some protein sequences and I would like to evaluate the performance of the clustering for my dataset. Is there any tool for this provided I have a benchmarked clustering results for those sequences?
Also, Is there any script available to collect the actual sequences from cd-hit result file i.e. actual sequences instead of names in the following results
>Cluster 0
0 2799aa, >PF04998.6|RPOC2_CHLRE/275-3073... *
>Cluster 1
0 2214aa, >PF06317.1|Q6Y625_9VIRU/1-2214... at 80%
1 2215aa, >PF06317.1|O09705_9VIRU/1-2215... at 84%
2 2217aa, >PF06317.1|Q6Y630_9VIRU/1-2217... *
3 2216aa, >PF06317.1|Q6GWS6_9VIRU/1-2216... at 84%
4 527aa, >PF06317.1|Q67E14_9VIRU/6-532... at 63%
UPDATE: for clustering performance evaluation, I am using scikit
Not a direct answer to your question, but hopefully this will be useful to you. CD-HIT clusters only based on percent identity, which may not be the best property if your goal is to group sequences by evolutionary relatedness. An example: if sequences A-B are 70% identical, B-C are 80% identical and A-C are 75% identical, chances are high that all of them are related to each other. When you cluster at 70% by CD-HIT they will end up in the same cluster, but not if you do it at 80%.
If your goal is to actually cluster sequences by their relatedness, MCL is probably better than CD-HIT. There are scripts in that package that make it easy to do the clustering and to compare clustering solutions against a benchmark.
Thanks !! I will definitely try the MCL tool.
Could you please point me to the script that I can use for comparing clustering results with the benchmark? I have my clustering as well as the benchmark in the following tsv format (clusterId \t Name)
To compare
MCL
clusters you need to runclm
indist
mode (clm dist
). See here for a summary of various functions contained in the MCL package. Note that input files need to be in MCL's matrix format, not thetsv
as you listed above.The output of clustering evaluation looks like this:
The first line shows the numbers of clusters in reference (
c1
) and your solution, andd
values represent distances - smaller is better. Second line shows Rand, Jaccard and adjusted Rand measures of clustering - closer to 1 is better.Could you be more specific? There is an enormous amount of documentation on MCL and as a newbie to clustering, I'm finding it difficult to pick out what I actually need from this.
Similar to the original poster, want to cluster my sequences by relatedness.