Help with CD-hit commands for DNA sequences
1
0
Entering edit mode
7.7 years ago
frcamacho ▴ 210

Hi,

I want to use cd-hit to stringently cluster and remove redundant DNA sequences (~14,500 sequences). I was doing a blastn all vs all and filtering for 98% qcovsHSP and 98% percent identity. Then running a script to find out of all the matches to keep the longer sequences. However, I found cd-hit and this allows me to do the same , but also keeps track of the clusters for me. I was going through the commands and found some that would do what I have been doing 1) removing 98% query coverage and 98% percent identity 2) keep the longer sequence in a match.

Here is what I got to try to replicate a blastn all vs all: (Please correct me if I am wrong!)

cdhit -i input.fa -o output.fa -n 11 -g 1 -G 0 -aL .98

-n word size -g accurate mode -G local sequence identity -aL # of bases in longer sequence in alignment / longer sequence length

However, I can't seem to find an argument for percent identity. I want 98% of the bases to match correctly in alignment. Any help will be appreciated!

cdhit software • 4.0k views
ADD COMMENT
0
Entering edit mode

If you are dealing with DNA, you probably want to use cdhit-est

ADD REPLY
0
Entering edit mode

I thought cdhit -est were not good for large sequences. My max is 71KB large. Is cdhit-est ok?

ADD REPLY
2
Entering edit mode
7.7 years ago
abascalfederico ★ 1.2k

Hi, In the version I have the maximum % identity is controlled through "-c"

    -c  sequence identity threshold, default 0.9
this is the default cd-hit's "global sequence identity" calculated as:
number of identical amino acids in alignment
divided by the full length of the shorter sequence
ADD COMMENT
0
Entering edit mode

Which version are you running? I am running 4.6 (built on Jul 29 2016)

ADD REPLY
1
Entering edit mode

Mine is version 4.5.4, but I've checked more recent versions still use -c

ADD REPLY

Login before adding your answer.

Traffic: 2003 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6