Question

Remove redundancy from GenBank plasmid database using cd-hit-est

0

Entering edit mode

8.6 years ago

wanderingstefan ▴ 30

Hey all,

I have the following problem. I have a plasmid sequence database (ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/plasmid/) that is heavily redundant. I have been trying to remove redundancy and to obtain a set of representative sequences using cd-hit-est (http://weizhong-lab.ucsd.edu/cd-hit/wiki/doku.php?id=cd-hit_user_guide) as follows: cd-hit-est -i fastadb -o outfilename -c 0.95 -n 9

The results of this are one file containing the clusters, and another containing the representative sequences. Now to my problem: Removing the redundancy from the database does not seem to work. Two sequences that are 100% identical over 100% of the sequence length (they have the same length) end up in different clusters instead of the same one. I have checked the similarity of the sequences aligning them through BLAST, and as stated above, the sequences are identical.

Does anyone know what the problem here might be? Am I missing something?

Thanks in advance!

sequence next-gen alignment • 2.4k views

ADD COMMENT • link updated 8.6 years ago by 5heikki 11k • written 8.6 years ago by wanderingstefan ▴ 30

score 2 · Accepted Answer · 2016-05-19

2

Entering edit mode

8.6 years ago

5heikki 11k

The problem is that you did not bother to check what the default options are.

   -g   1 or 0, default 0
    by cd-hit's default algorithm, a sequence is clustered to the first 
    cluster that meet the threshold (fast cluster). If set to 1, the program
    will cluster it into the most similar cluster that meet the threshold
    (accurate but slow mode)

ADD COMMENT • link 8.6 years ago by 5heikki 11k

1

Entering edit mode

Hey, thanks for your answer. However, running it like cd-hit-est -i fastadb -o outfilename -c 0.95 -n 9 -g 1 does not resolve my problem. my clustering file still looks like this:

>Cluster 39
0   6222nt, >gi|410475454|ref|NC... *
>Cluster 40
0   6211nt, >gi|387504713|ref|NC... at +/98.10%
1   6222nt, >gi|41056918|ref|NC_... *
2   6222nt, >gi|118480566|ref|NC... at +/98.09%
>Cluster 41
0   6222nt, >gi|844749291|ref|NZ... *

The sequences that are 6222 bases long are at least 99% similar over the whole length, but still end up in different clusters..

ADD REPLY • link 8.6 years ago by wanderingstefan ▴ 30

2

Entering edit mode

From those sequences only cluster 40 members are within 95% similarity over cd-hit-est default alignment coverage cutoffs.

Let's have a look with blastn:

blastn -query 410475454.fna -subject 844749291.fna -outfmt 6
gi|410475454|ref|NC_019040.1|   gi|844749291|ref|NZ_CP006639.1| 100.000 4693    0       0       1530    6222    1       4693    0.0     8667
gi|410475454|ref|NC_019040.1|   gi|844749291|ref|NZ_CP006639.1| 99.935  1529    1       0       1       1529    4694    6222    0.0     2819

The sequences are indeed very similar. However, their linear representations begin from completely different locations! I don't think any clustering algorithm considers circular topology as an option..

ADD REPLY • link 8.6 years ago by 5heikki 11k

1

Entering edit mode

Ah, now I see! I just looked at the graphical output of blast, but was not aware that the slash in the middle of the sequence marked the beginning of the alignment! Then I know why they are in different clusters. Thank you for your answer!

ADD REPLY • link 8.6 years ago by wanderingstefan ▴ 30