Question

Extracting only 4-fold degenerate sites from gene sequences/alignments?

2

Entering edit mode

14 months ago

J. ▴ 50

Hi all,

For background: I am working on building a phlyogenetic tree for ~200 taxa, using published WGS sequences with genes predicted/annotated using TOGA. So far I have been using the whole coding sequence for each gene and taxon, but for our purposes it would be better to use only the 4-fold degenerate sites to ensure neutral evolution.

I was wondering if anyone can suggest any straightforward way to pull out only the 4-fold degenerate ("4d") sites either from the individual genes (pre-allignment), or from the alignment files directly?

It seems like this is fairly standard practice for a lot of phylogenomic analysis, going by the literature, but I can't see any straightforward way to do it without writing my own script (for e.g. Extracting four fold degenerate sites from nucleotide alignment) or manually selecting bases in a GUI (e.g. https://www.megasoftware.net/web_help_10/Writing_only_4-fold_degenerate_sites_to_an_output_file.htm). I feel like I am surely missing something here, as it seems like a) a fairly common task and b) very computationally simple to do, given an in-frame nucleotide sequence (or any sequence with codon annotation). A colleague suggested it is possible to output just the 4d sites from a sequence by running the genes/alignments through codeml in PAML, but I don't see any reference to such a feature in the codeml manual, unless I'm missing something.

Do I just have to bite the bullet and write my own python script (or similar), or is there an existing solution somewhere I'm not seeing?

Thanks in advance,

J.

alignments • 1.0k views

ADD COMMENT • link 14 months ago by J. ▴ 50

score 3 · Answer 1 · 2024-05-14

3

Entering edit mode

14 months ago

J. ▴ 50

FYI if you also have this problem, this seems to be a pretty good solution: https://github.com/harvardinformatics/degenotate

ADD COMMENT • link 14 months ago by J. ▴ 50