Hello,
I am trying to run my fasta file with ~1.98M sequences with 80% threshold on CD-HIT-EST and it seems to take a really long time (more than what my supercomputing cluster would allow which is 14 days). I am running it at max memory and cores (2.9TB, 80 cores). I have read here that a step down approach could reduce run time? For example, running my initial fasta at 95% threshold, then 90%, 85% and lastly 80%, each time using the CD-HIT output fasta from the previous run as input.
Is this a feasible way? Are there other option for clustering ~1.98M nucleotide sequences at 80% threshold much faster than this?
Thanks!
vsearch
is a greatcd-hit
alternative, but like Mensur Dlakic commented, it sounds like there's something wrong with your command..