I am writing to seek assistance regarding the usage of CD-HIT software for clustering a dataset of 135,000 nucleotide sequences. Currently, I am working on a cluster with 16 CPUs, and the maximum time limit available on this cluster is one week.
I have attempted to improve the performance of the CD-HIT process by employing various parameters, such as -T and -M, but unfortunately, none of them have proven to accelerate the execution time significantly.
Moreover, I have come across the option of using cd-hit-para, but I am uncertain about its usage since it requires specifying IP addresses, whereas I only have information regarding the number of CPUs available on the cluster.
I would greatly appreciate any assistance or guidance you can provide to help optimize the CD-HIT clustering process in my current setup.
Thank you in advance for your support.
Maybe not a solution but have you thought about using other clustering tools (e.g. UCLUST or MMseqs2)?