long PCR, 6000 bp, long indels (>100bp), multicopy gene = multiple amplicons from same PCR but only differ in indel meaning minimal substitutions: variance~1%, 2000 reads per sample, homopolymers (12bp) and tandem repeats (up to 55 fold, length 12-250bp), no reference available
I want to de-noise my amplicons and generate one consensus sequence for each of the included variants of my target amplicon.
I tried MAFFT with the fastest setting but it is still too slow and does not make use of multiple cores (fastest setting is a progressive alignment). I tried supposedly fast aligners like MAGUS but it never finished under one hour per sample. I tried flye and it cuts some of the tandem repeats, although other aligners do much worse. I tried the Geneious assembler meant for Sanger data and for small read numbers it works quite well, but this data is too much for it. I tried Amplicon_sorter but it is not sensitive enough to catch the variants. I tried halign but it does not properly understand indels and oversplits the alignment. I tried k-mer clustering but it does not work because the variants are different in the number of repeats and this is not reflected by k-mer comparisons (2 sequences with different repeat counts generate the same k-mers).
Any ideas? Most assemblers do not understand that there is no correction of repeats needed, my data is all complete, i.e. start to finish, and therefore they generate a lot of errors. Also I do not want to have my reads extended for the same reason, yet magically some assemblers find a way to extend on my reads.
This is what the data looks like after a MAFFT alignment and elimination of all nucleotides with less than 5% frequency per col:
Yes. Does not work. Requires 80% distance between amplicons. It is meant to identify different genes not variants of the same gene.