Hey all,
I have the following problem. I have a plasmid sequence database (ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/plasmid/) that is heavily redundant. I have been trying to remove redundancy and to obtain a set of representative sequences using cd-hit-est (http://weizhong-lab.ucsd.edu/cd-hit/wiki/doku.php?id=cd-hit_user_guide) as follows: cd-hit-est -i fastadb -o outfilename -c 0.95 -n 9
The results of this are one file containing the clusters, and another containing the representative sequences. Now to my problem: Removing the redundancy from the database does not seem to work. Two sequences that are 100% identical over 100% of the sequence length (they have the same length) end up in different clusters instead of the same one. I have checked the similarity of the sequences aligning them through BLAST, and as stated above, the sequences are identical.
Does anyone know what the problem here might be? Am I missing something?
Thanks in advance!
Hey, thanks for your answer. However, running it like
cd-hit-est -i fastadb -o outfilename -c 0.95 -n 9 -g 1
does not resolve my problem. my clustering file still looks like this:The sequences that are 6222 bases long are at least 99% similar over the whole length, but still end up in different clusters..
From those sequences only cluster 40 members are within 95% similarity over cd-hit-est default alignment coverage cutoffs.
Let's have a look with blastn:
The sequences are indeed very similar. However, their linear representations begin from completely different locations! I don't think any clustering algorithm considers circular topology as an option..
Ah, now I see! I just looked at the graphical output of blast, but was not aware that the slash in the middle of the sequence marked the beginning of the alignment! Then I know why they are in different clusters. Thank you for your answer!