Question

Help with CD-hit commands for DNA sequences

0

Entering edit mode

7.7 years ago

frcamacho ▴ 210

Hi,

I want to use cd-hit to stringently cluster and remove redundant DNA sequences (~14,500 sequences). I was doing a blastn all vs all and filtering for 98% qcovsHSP and 98% percent identity. Then running a script to find out of all the matches to keep the longer sequences. However, I found cd-hit and this allows me to do the same , but also keeps track of the clusters for me. I was going through the commands and found some that would do what I have been doing 1) removing 98% query coverage and 98% percent identity 2) keep the longer sequence in a match.

Here is what I got to try to replicate a blastn all vs all: (Please correct me if I am wrong!)

cdhit -i input.fa -o output.fa -n 11 -g 1 -G 0 -aL .98

-n word size -g accurate mode -G local sequence identity -aL # of bases in longer sequence in alignment / longer sequence length

However, I can't seem to find an argument for percent identity. I want 98% of the bases to match correctly in alignment. Any help will be appreciated!

cdhit software • 4.0k views

ADD COMMENT • link 7.7 years ago by frcamacho ▴ 210

0

Entering edit mode

If you are dealing with DNA, you probably want to use cdhit-est

ADD REPLY • link 7.7 years ago by h.mon 35k

0

Entering edit mode

I thought cdhit -est were not good for large sequences. My max is 71KB large. Is cdhit-est ok?

ADD REPLY • link 7.7 years ago by frcamacho ▴ 210

score 2 · Answer 1 · 2017-03-15

2

Entering edit mode

7.7 years ago

abascalfederico ★ 1.2k

Hi, In the version I have the maximum % identity is controlled through "-c"

    -c  sequence identity threshold, default 0.9
this is the default cd-hit's "global sequence identity" calculated as:
number of identical amino acids in alignment
divided by the full length of the shorter sequence

ADD COMMENT • link 7.7 years ago by abascalfederico ★ 1.2k

0

Entering edit mode

Which version are you running? I am running 4.6 (built on Jul 29 2016)

ADD REPLY • link 7.7 years ago by frcamacho ▴ 210

1

Entering edit mode

Mine is version 4.5.4, but I've checked more recent versions still use -c

ADD REPLY • link 7.7 years ago by abascalfederico ★ 1.2k