Question

Word size CD-HIT-EST

0

Entering edit mode

2.6 years ago

Nathan ▴ 10

Hello. I am trying to cluster a huge fasta file containing using CD-HIT-EST with a threshold of 80%. According to the user's guide (http://www.bioinformatics.org/cd-hit/cd-hit-user-guide.pdf), I should use a word size (- n) of 5. However, it is taking forever. Could I change this parameter to -n 10 to speed up the process without changes in the final result, i. e., get the same result as -n 5?

This is my command:

cd-hit-est -i input -o output -d 0 -T 16 -g 0 -M 75000 -aL 0.97 -aS 0.97 -c 0.8 -n 5 -b 1

clustering cd-hit-est cd-hit • 1.7k views

ADD COMMENT • link 2.6 years ago by Nathan ▴ 10

score 0 · Answer 1 · 2022-04-14

0

Entering edit mode

2.6 years ago

Mensur Dlakic ★ 28k

In a word, no. The word size of 5 is already an upper limit for clustering at 80% identity. You will have to get a faster computer with more threads and memory, work with a smaller database, or just be patient. It may help to know that the process is not linear as the largest sequences are clustered first, so it will speed up as it goes along.