I'm dealing with sequences from 39 different S. cerevisiae strains obtained from low-coverage NGS (from this paper [1]). I noticed that when I'm trying to align homologous genes, I get many small gaps in places where there are indels, primarily in low complexity regions. Typically, there's an insertion in one or two strains within a stretch of As (or, conversely, a deletion in some strains). Since it's unlikely that frameshift mutations would be so widespread, I'm assuming these are due to read errors.
All the multiple alignment programs I tried assumed that all bases are 'true,' i. e. they just insert a gap whenever the situation I described occurs. This causes some of the sequences to go out-of-frame. I would like to calculate codon bias measures for these sequences so it's vital that they are all in the same frame.
Is there a multiple alignment software or a frameshifting algorithm that can remove such spurious indels?
[1] http://www.nature.com/nature/journal/v458/n7236/full/nature07743.html
I couldn't find a ready-made tool to do this. I'm going to roll my own script to remove that gaps but that's likely not the 'right' way to do it. An assembly tool which takes conservation and read quality scores into account would be a more correct way to do this...