I have a very deep illumina run on 5 amplicons subjected to mutation stress. The PI wants to get per-base mutation frequencies at each position of each amplicon, mapped to the original amplicon reference. It was a paired-end illumina run with a read length of 250. The reads were merged. The longest amplicon is 348, so I have full-length merged reads. Depth: about 18000x. There are 12 samples, each with 5 different amplicons which I have separated into 60 fastq files.
I've been running freebayes on galaxy using the frequency-based pooled calling option, but roughly half of the samples/amplicons are not finishing. The other half finish in under an hour.
From looking at the ones that finished, I see some really long rows in the output, in regions where there are some large-ish indels (on the order of 50nts). The indels themselves do not appear to be very frequent, but there are many many alternate observations for each SNP there-in. So I took a look at the population of lengths of the sequences in the fastq files. Reliably, the files with are taking forever for freebayes to process have over double the number of variety of lengths.
Is there a way to get freebayes to ignore large-ish indels, yet still give me stats on small ones? There are a lot of options, and it's not clear to me the best avenue around this issue. Do you think that my hunch - that these large-ish indels are what is causing the poorly scaled run-times?
BTW, I know there's an issue with dealing with errors mixed with real mutations. As of right now, I just want to get past the running-time issue. So far, I've only tried running for 24 hours. The freebayes processes are chugging along using 99% CPU.