I have gene alignments between humans and chimpanzees and I need to remove GC dinucelotides between humans and chimpanzees. My question involves the best way to proceed with this. Is it at the codon level or at the sequence level?
For instance, if I have the sequence ACTGCA
this can be split into the two codons ACT
and GCA
. Therefore I can remove the second codon from both the human and chimpanzee sequence and the alignment length should be fine. The problem with this method is that it doesn't account for GC dinucelotides that are across codons (e.g. TCGCAA
, where splitting into codons gives us TCG
and CAA
).
The alternative is simply to remove every GC dinucleotide from the sequence, but this may end up reducing the sequence to a length that isn't divisible by 3 (i.e. we cannot neatly split it into codons). For example, if we remove all GC dinucelotides from the sequence TCAGCGCAT
we are left with TCAAT
which is an incorrect length. As I am dealing with alignments between humans and chimpanzees (and will be running PAML which requires sequences to be of length divisible by 3), this could be problematic. This is likely quite an obvious problem but I am unsure of how to proceed. Any suggestions?
EDIT: As per the comment below, the reason we wish to do this is because CpGs have much higher rates of mutation than other dinucleotides in humans.The problem here is that the density of CpGs differs between synonymous and non-synonymous sites. We are pooling sites to calculate rates of adaptive evolution for different amino acids.
Why? Your entire question revolves around this need yet there is no explanation for this need.
Hi, please see the edit. Thanks.
You may be better off soft/hard masking
GC
s and using an alignment tool that works well with masked sequences (most of them should).