Hi all,
For background: I am working on building a phlyogenetic tree for ~200 taxa, using published WGS sequences with genes predicted/annotated using TOGA. So far I have been using the whole coding sequence for each gene and taxon, but for our purposes it would be better to use only the 4-fold degenerate sites to ensure neutral evolution.
I was wondering if anyone can suggest any straightforward way to pull out only the 4-fold degenerate ("4d") sites either from the individual genes (pre-allignment), or from the alignment files directly?
It seems like this is fairly standard practice for a lot of phylogenomic analysis, going by the literature, but I can't see any straightforward way to do it without writing my own script (for e.g. Extracting four fold degenerate sites from nucleotide alignment) or manually selecting bases in a GUI (e.g. https://www.megasoftware.net/web_help_10/Writing_only_4-fold_degenerate_sites_to_an_output_file.htm). I feel like I am surely missing something here, as it seems like a) a fairly common task and b) very computationally simple to do, given an in-frame nucleotide sequence (or any sequence with codon annotation). A colleague suggested it is possible to output just the 4d sites from a sequence by running the genes/alignments through codeml in PAML, but I don't see any reference to such a feature in the codeml manual, unless I'm missing something.
Do I just have to bite the bullet and write my own python script (or similar), or is there an existing solution somewhere I'm not seeing?
Thanks in advance,
J.