Entering edit mode
6.1 years ago
bitpir
▴
250
I'm trying to run CDHIT to cluster ~250M of cds at nucleotide/protein levels. These are mostly NR-like sequences from NCBI. According to the paper it takes ~ 140 mins to cluster 4M seqs with 8 core. When I run the job, it took > 12 hours to process 1M seqs. I've tried increasing the #cpu to 24 but it still doesn't change the speed that much. Below are the commands that I used for running the clustering. Any help is appreciated! Thanks!
cd-hit-v4.6.8-2017-1208/cd-hit-est -I f1.nuc -o f1.nuc.out -n 10 -M 0 -T 8 -c 0.95 -r 0
cd-hit-v4.6.8-2017-1208/cd-hit -I f1.pep -o f1.pep.out -n 5 -M 0 -T 8 -c 0.95