I have a collection of fastq files, each of which represents sequencing data from between one and three distinct but similar clones. There are between a few hundred and a few hundreds of thousands of reads per file. The reads have already been curated, so they are trimmed, demultiplexed, paired and overlapped. Each sequence is about 300-400 bases long, and each represents nearly the full sequence of interest (i.e. the final sequence is also about 300-400, and each read covers it nearly fully).
I want to assemble the reads to find consensus sequences. I can't just use a multiple alignment for this, because a significant minority of the wells are polyclonal and contain very similar, but distinct, sequences. I want an assembler that will simply line up the reads, determine which sets represent distinct clones, and return the consensus sequences along with the number of reads that fit that consensus.
Most of the assemblers I've tried simply won't accept that the situation is as simple is it is, but earnestly want to perform much more complex operations, and won't let me override them. cap3 will take the paired-end set, overlap them, and return what I want, but cap3 is quite slow (several days to assemble the largest collections). Ray does most of the work, but I can't find a way to force Ray to tag the contigs with the number of reads in each. (This needs to be automated, because there are hundreds of files.)
What assemblers will take a single file (or a paired-end file if it can overlap them rapidly), and align reads to identify distinct clones without trying to build a full genome out of them?
hi,
Have you looked at tools used for OTU clustering like Usearch and Swarm
Thanks, "clustering" was the key word I should have been searching for, and turns up several applications including Swarm that seem to do exactly what I want.