Hi,
I am looking to cluster a large number (0.5-3 million) of highly similar MiSeq reads. Pair-end reads are being sequenced but clustering will only be carried out on the forward reads. The sequences will be trimmed down to ~150bp before sequencing.
I have already given USEARCH and CD-HIT a go but the problem is that the sequences are so highly similar that the difference between them can be less than would be expected as sequencing error (>~2%). When I manually inspected the clusters (after a MSA) I can see there are multiple species within a single cluster.
I am hoping to find a tool that can do clustering whiles accounting for the frequency of mismatches at specific positions within the sequences similar to the way a Bruijin graph assembler would work. So if the mismatch occurs at a low frequency then it is likely a sequencing error but if it occurs more frequently then the sequence would be considered a different species.
At the moment I am considering making my own algorithm that does the job but if something already exists then i'm sure it would be much more efficient.
Thanks for any help.