Question

Clustering Highly Similar Sequences

1

Entering edit mode

12.1 years ago

edm1 ▴ 10

Hi,

I am looking to cluster a large number (0.5-3 million) of highly similar MiSeq reads. Pair-end reads are being sequenced but clustering will only be carried out on the forward reads. The sequences will be trimmed down to ~150bp before sequencing.

I have already given USEARCH and CD-HIT a go but the problem is that the sequences are so highly similar that the difference between them can be less than would be expected as sequencing error (>~2%). When I manually inspected the clusters (after a MSA) I can see there are multiple species within a single cluster.

I am hoping to find a tool that can do clustering whiles accounting for the frequency of mismatches at specific positions within the sequences similar to the way a Bruijin graph assembler would work. So if the mismatch occurs at a low frequency then it is likely a sequencing error but if it occurs more frequently then the sequence would be considered a different species.

At the moment I am considering making my own algorithm that does the job but if something already exists then i'm sure it would be much more efficient.

Thanks for any help.

clustering ngs • 2.7k views

ADD COMMENT • link updated 12.1 years ago by Josh Herr 5.8k • written 12.1 years ago by edm1 ▴ 10

score 1 · Answer 1 · 2013-05-29

The short answer to your question is go back and read the USEARCH and CD-HIT manuals, as you can certainly set sequence similarity as a flag in your clustering.

I am not aware of the ability to set a specific motif or sequence string or region for mismatches, but if you are referring to general mismatches these can be accounted for. You did not give us any indication what your research question is so it's hard to know what to recommend -- it's easy to differentiate SNPs or other variation from sequencing errors when you have millions of reads of the same sequence.

You say multiple species in a cluster? How do you know? What is a species anyway -- do you mean OTUs? Are these sequences all from the same organism, or a population, or a community? Did you attain them through using the same primer? Have you tried Stacks?

score 0 · Answer 2 · 2013-05-28

0

Entering edit mode

12.1 years ago

Ido Tamir 5.2k

You could try the various error correction tools to remove sequencing errors before clustering.

ADD COMMENT • link 12.1 years ago by Ido Tamir 5.2k