Hi,
I want to use cd-hit to stringently cluster and remove redundant DNA sequences (~14,500 sequences). I was doing a blastn all vs all and filtering for 98% qcovsHSP and 98% percent identity. Then running a script to find out of all the matches to keep the longer sequences. However, I found cd-hit and this allows me to do the same , but also keeps track of the clusters for me. I was going through the commands and found some that would do what I have been doing 1) removing 98% query coverage and 98% percent identity 2) keep the longer sequence in a match.
Here is what I got to try to replicate a blastn all vs all: (Please correct me if I am wrong!)
cdhit -i input.fa -o output.fa -n 11 -g 1 -G 0 -aL .98
-n word size -g accurate mode -G local sequence identity -aL # of bases in longer sequence in alignment / longer sequence length
However, I can't seem to find an argument for percent identity. I want 98% of the bases to match correctly in alignment. Any help will be appreciated!
If you are dealing with DNA, you probably want to use
cdhit-est
I thought cdhit -est were not good for large sequences. My max is 71KB large. Is cdhit-est ok?