I recently ran a whole-genome assembly with AllPathsLG. Several of the scaffolds in the resulting assembly contain R and Y characters in the sequence. These are the IUPAC symbols for purines and pyrimidines, respectively, but I have no idea why they would show up in the assembly. It doesn't look like the AllPathsLG manual sheds any light on the issue.
Do these symbols have an alternative meaning in AllPathsLG, or am I missing something?
ALLPATHS-LG preserves heterozygotes as much as possible. Those are hets it believes to be true in the genome. This is a welcomed feature.
The documentation did mention that it attempts to preserve as much ambiguity as possible, but I thought this was referring to unresolved repeat regions, homopolymers, etc. I agree that preserving this information is welcome, it just complicates downstream analysis with software that requires simple alphabets.
Just convert R randomly to A or G. The majority of assemblers effectively do this.
Thanks! You're welcome to get the accepted answer if you add your comment as an answer.